Please help me with the following python pandas practice! Will
rate good answers!
.
Q3.
From what you see in the histogram above, does the convenience
sample give us an accurate picture of points for the full
population of NBA players? Would you expect it to, in general?
Select from the following answers.
1. Yes. The sample is large enough, so it is an accurate
representation of the population.
2. No. The sample is too small, so it won't give us an accurate
representation of the population.
3. No. But this was just an unlucky sample, normally this would
give us an accurate representation of the population.
4. No. This type of sample doesn't give us an accurate
representation of the population.
Producing simple random samples
Often it's useful to take random samples even when we have a
larger dataset available. One reason is that it can help us
understand how inaccurate other samples are.
DataFrames provide the method `sample()` for producing simple
random samples. Note that its default is to sample **without**
replacement.
Q4.
Produce a simple random sample *without replacement* of size 67
from `full_data`. Run your analysis on it again, and store the
resulting array of mean `'Points'` and mean `'Salary'` in
`my_small_stats`.
code: my_small_stats = ...
Run the cell containing `my_small_stats` several times to get
new samples and new sample means.
Are your results similar to those in the small sample we
provided you? Do things change a lot across separate samples?
Select from the following answers.
1. The results are very different from the small sample, and
don't change at all across separate samples.
2. The results are not at all different from the small sample,
and change a bit across separate samples.
3. The results are somewhat different from the small sample, and
change a bit across separate samples.
4. The results are not at all different from the small sample,
and don't change at all across separate samples.
Q5.
Similarly, create a simple random sample *without replacement*
of size 175 from `full_data` and store an array of the sample's
mean `'Points'` and mean `'Salary'` in `my_large_stats`.
code: my_large_stats = ...
Run the cell containing `my_large_stats` many times.
Do the histograms and mean statistics seem to change more or
less across samples of this size than across samples of size 56?
And for which variable are the sample means and histograms closer
to their true values – `'Points'` or `'Salary'`? Assign either 1,
2, 3, 4, or 5 to the variable `sampling_q5` below.
Is this what you expected to see?
1. The statistics change *less* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Points'` than they are for `'Salary'`.
2. The statistics change *less* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Salary'` than they are for `'Points'`.
3. The statistics change *more* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Points'` than they are for `'Salary'`.
4. The statistics change *more* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Salary'` than they are for `'Points'`.
5. The statistics change an *equal amount* across samples of
this size as across smaller samples. The statistics for `'Points'`
and `'Salary'` are *equally close* to their true values.
Homework 4: Simulation, Sampling, and Hypothesis Testing , # please don't change this cell, but do make sure to run it import pandas as pd import matplotlib.pyplot as plt import numpy as np [1] ✓ 1.25 Python 1. Sampling with NBA Data In this question, we'll use our familiar player and salary data from the 2015-16 NBA season to get some practice with sampling. Run the cells below to load the player and salary data, which come from different DataFrames, and to merge them into a single DataFrame, indexed by player. DE DE DB .. De player_data = pd.read_csv ("data/player_data.csv").set_index("Name') salary_data = pd.read_csv("data/salary_data.csv").set_index('PlayerName') full_data = salary_data.merge(player_data, left_index=True, right_index=True) full_data Python = = [2] [ 0.55 Salary Age Team Games Rebounds Assists Steals Blocks Turnovers Points Kobe Bryant 23500000 36 LAL 35 199 197 47 7 128 782 Amare Stoudemire 23410988 32 TOT 59 329 45 29 38 78 680 Joe Johnson 23180790 33 BRK 80 384 292 59 14 137 1154 Carmelo Anthony 22458401 30 NYK 40 264 122 40 17 89 966 Dwight Howard 21436271 29 HOU 41 431 50 28 53 115 646 Sim Bhullar 29843 22 SAC 3 1 1 0 1 0 2 David Stockton 29843 23 SAC 3 2 9 2 0 0 4 David Wear 29843 24 SAC 2 2 1 0 0 Andre Dawkins f woo 29843 23 MIA 4 4 2 2 1 w o o 0 1 Vander Blue 14409 22 LAL 2 9 8 3 0 6 22 492 rows x 10 columns
We'll start by creating a function called compute_statistics that takes as input a DataFrame with two columns, ' Points' and 'Salary', and then: • draws a histogram of 'Points', • draws a histogram of 'Salary', and • returns a two-element array containing the mean’Points' and mean 'Salary'. Run the cell below to define the compute_statistics function, and a helper function called histograms. Don't worry about how this code works, and please don't change anything. # Don't change this cell, just run it. def histograms (df): points = df.get('Points').values salaries = df.get('Salary').values a = plt.figure(1) plt.hist(points, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(e, 2500, 50)) plt.title('Distribution of Points) 5 = plt.figure(2) plt.hist(salaries, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 3.5 * 10**7, 2.5 * 10**6)) plt.title( 'Distribution of Salaries') def compute_statistics(points_and_salary_data, draw=True): if draw: histograms (points_and_salary_data) points = np.average (points_and_salary_data.get('Points').values) salary = np.average (points_and_salary_data.get('Salary').values) avg_points_salary_array = np.array([points, salary]) return avg_points_salary_array 1 Python We can use this compute_statistics function to show the distribution of 'Points' and 'Salary' and compute their means, for any collection of players. Run the next cell to show these distributions and compute the means for all NBA players. Notice that the array containing the mean 'Points' and mean’Salary' values is displayed before the histograms, and the numbers are given in scientific notation. full_stats = compute_statistics(full_data) full_stats ] Python
array([5.00071138e+02, 4.26977577e+06]) Distribution of Points 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 0 500 1000 1500 2000 2500 le-7 Distribution of Salaries 2.00 1.75 1.50 125 100 0.75 0.50 0.25 0.00 0.0 0.5 10 15 2.0 2.5 3.0 le7
Next, we'll compare the distribution of points in our convenience sample with distribution of points for all players in our dataset. D D D B ... # just run this cell, don't change it def compare_points(first, second, first_title, second_title): ***"Compare the points in two DataFrames. """ bins = np.arange(0, 2500, 50) first.plot(kind='hist', y='Points', binsbins, density=True, ec='w', color='blue', alpha=0.5) plt.title('Points Distribution for + first_title) second.plot(kind='hist', y='Points', bins=bins, density=True, ec='w', color='blue', alpha=0.5) plt.title('Points Distribution for + second_title) compare_points(full_data, convenience_sample, 'All Players', 'Convenience Sample')
Points Distribution for All Players Points 0.0025 0.0020 0.0015 Frequency 0.0010 0.0005 0.0000 0 500 1000 1500 2000 2500
Simple random sampling A more principled approach is to sample uniformly at random from the players. If we ensure that each player is selected at most once, this is a simple random sample without replacement, sometimes abbreviated to "simple random sample" or "SRS". Imagine writing down each player's name on a card, putting the cards in a hat, and shuffling the hat. To sample, pull out cards one by one and set them aside, stopping when the specified sample size is reached. We've produced two samples of salary_data in this way: small_srs_salary.csv and large_srs_salary.csv contain, respectively, a sample of size 67 and a larger sample of size 150. Now we'll run the same analyses on the small simple random sample, the large simple random sample, and the convenience sample. The load_data function below loads a salary table and merges it with player_data. The subsequent code draws the histograms and computes the means for 'Points' and 'Salary'. A ...
# Don't change this cell, but do run it. def load_data( salary_file): return player_data.merge(pd.read_csv( salary_file), left_index=True, right_on='PlayerName') small_srs_data = load_data("data/small_srs_salary.csv") large_srs_data = load_data('data/large_srs_salary.csv') = = small_stats compute_statistics(small_srs_data, draw=False); large_stats compute_statistics (large_srs_data, draw=False); convenience_stats compute_statistics(convenience_sample, draw=False); = ', full_stats) print('Full data stats: print('Small SRS stats:', small_stats) print('Large SRS stats:', large_stats) print('Convenience sample stats: ', convenience_stats) = color_dict { 'small simple random': 'blue', "large simple random': 'green'. 'convenience': 'orange' } plt.subplots(3, 2, figsize=(15, 15), dpi=100) i = 1 for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()): plt.subplot(3, 2, i) i += 2 plt.hist(df.get('Points"), density=True, alpha=0.5, color=color_dict[name], ec='w', bins=np.arange(0, 2500, 50)); plt.title(f'Points histogram for {name} sample') i = 2 for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()): plt.subplot(3, 2, i) i += 2 plt.hist(df.get('Salary'), density=True, alpha=0.5, color=color_dict[name], ec='w', bins=np.arange(0, 3.5 * 10**7, 2.5 * 10**6)); plt.title(f'Salary histogram for {name} sample') #plt.show()
Please help me with the following python pandas practice! Will rate good answers! . Q3. From what you see in the histogr
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am