Please help me with the following python pandas practice! Will rate good answers! Q1. Suppose you live in New Jersey and

Post by **answerhappygod** » Wed Mar 30, 2022 9:18 am

Please help me with the following python pandas practice! Will
rate good answers!

: Please Help Me With The Following Python Pandas Practice Will Rate Good Answers Q1 Suppose You Live In New Jersey And 1 (72.47 KiB) Viewed 40 times

: Please Help Me With The Following Python Pandas Practice Will Rate Good Answers Q1 Suppose You Live In New Jersey And 2 (21.52 KiB) Viewed 40 times

Q1.
Suppose you live in New Jersey and you only survey players from
the three closest teams:
- New York Knicks (`'NYK'`)
- Brooklyn Nets (`'BRK'`)
- Philadelphia 76ers (`'PHI'`)
Assign `convenience_sample` to a subset of `full_data` that
contains only the rows for players on one of these three teams.
code: convenience_sample =
...
Q2.
Assign `convenience_stats` to an array of the mean `'Points'`
and mean `'Salary'` of your convenience sample. Since they're
computed on a sample, these are called *sample means*.
*Hint*: It's fine to draw two histograms as well as assign the
variable `convenience_stats`.
code: convenience_stats = ...

: Please Help Me With The Following Python Pandas Practice Will Rate Good Answers Q1 Suppose You Live In New Jersey And 3 (26.14 KiB) Viewed 40 times

Q3.
From what you see in the histogram above, does the convenience
sample give us an accurate picture of points for the full
population of NBA players? Would you expect it to, in
general? Select from the following answers.
1. Yes. The sample is large enough, so it is an accurate
representation of the population.
2. No. The sample is too small, so it won't give us an accurate
representation of the population.
3. No. But this was just an unlucky sample, normally this would
give us an accurate representation of the population.
4. No. This type of sample doesn't give us an accurate
representation of the population.

: Please Help Me With The Following Python Pandas Practice Will Rate Good Answers Q1 Suppose You Live In New Jersey And 4 (36.88 KiB) Viewed 40 times

: Please Help Me With The Following Python Pandas Practice Will Rate Good Answers Q1 Suppose You Live In New Jersey And 5 (50.99 KiB) Viewed 40 times

Producing simple random samples
Often it's useful to take random samples even when we have a
larger dataset available. One reason is that it can help us
understand how inaccurate other samples are.
DataFrames provide the method `sample()` for producing simple
random samples. Note that its default is to sample
**without** replacement.
Q4.
Produce a simple random sample *without replacement* of size 67
from `full_data`. Run your analysis on it again, and store the
resulting array of mean `'Points'` and mean `'Salary'` in
`my_small_stats`.
code: my_small_stats = ...
Run the cell containing `my_small_stats` several times to get
new samples and new sample means.
Are your results similar to those in the small sample we
provided you? Do things change a lot across separate samples?
Select from the following answers.
1. The results are very different from the small sample, and
don't change at all across separate samples.
2. The results are not at all different from the small sample,
and change a bit across separate samples.
3. The results are somewhat different from the small sample, and
change a bit across separate samples.
4. The results are not at all different from the small sample,
and don't change at all across separate samples.
Q5.
Similarly, create a simple random sample *without replacement*
of size 175 from `full_data` and store an array of the sample's
mean `'Points'` and mean `'Salary'` in `my_large_stats`.
code: my_large_stats = ...
Run the cell containing `my_large_stats` many times.
Do the histograms and mean statistics seem to change more
or less across samples of this size than across samples of size 56?
And for which variable are the sample means and histograms
closer to their true values – `'Points'` or `'Salary'`?
Assign either 1, 2, 3, 4, or 5 to the variable `sampling_q5`
below.
Is this what you expected to see?
1. The statistics change *less* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Points'` than they are for `'Salary'`.
2. The statistics change *less* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Salary'` than they are for `'Points'`.
3. The statistics change *more* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Points'` than they are for `'Salary'`.
4. The statistics change *more* across samples of this size than
across smaller samples. The statistics are closer to their true
values for `'Salary'` than they are for `'Points'`.
5. The statistics change an *equal amount* across samples of
this size as across smaller samples. The statistics for `'Points'`
and `'Salary'` are *equally close* to their true values.
Homework 4: Simulation, Sampling, and Hypothesis Testing , # please don't change this cell, but do make sure to run it import pandas as pd import matplotlib.pyplot as plt import numpy as np [1] ✓ 1.25 Python 1. Sampling with NBA Data In this question, we'll use our familiar player and salary data from the 2015-16 NBA season to get some practice with sampling. Run the cells below to load the player and salary data, which come from different DataFrames, and to merge them into a single DataFrame, indexed by player. DE DE DB .. De player_data = pd.read_csv ("data/player_data.csv").set_index("Name') salary_data = pd.read_csv("data/salary_data.csv").set_index('PlayerName') full_data = salary_data.merge(player_data, left_index=True, right_index=True) full_data Python = = [2] [ 0.55 Salary Age Team Games Rebounds Assists Steals Blocks Turnovers Points Kobe Bryant 23500000 36 LAL 35 199 197 47 7 128 782 Amare Stoudemire 23410988 32 TOT 59 329 45 29 38 78 680 Joe Johnson 23180790 33 BRK 80 384 292 59 14 137 1154 Carmelo Anthony 22458401 30 NYK 40 264 122 40 17 89 966 Dwight Howard 21436271 29 HOU 41 431 50 28 53 115 646 Sim Bhullar 29843 22 SAC 3 1 1 0 1 0 2 David Stockton 29843 23 SAC 3 2 9 2 0 0 4 David Wear 29843 24 SAC 2 2 1 0 0 Andre Dawkins f woo 29843 23 MIA 4 4 2 2 1 w o o 0 1 Vander Blue 14409 22 LAL 2 9 8 3 0 6 22 492 rows x 10 columns
We'll start by creating a function called compute_statistics that takes as input a DataFrame with two columns, ' Points' and 'Salary', and then: • draws a histogram of 'Points', • draws a histogram of 'Salary', and • returns a two-element array containing the mean’Points' and mean 'Salary'. Run the cell below to define the compute_statistics function, and a helper function called histograms. Don't worry about how this code works, and please don't change anything. # Don't change this cell, just run it. def histograms (df): points = df.get('Points').values salaries = df.get('Salary').values a = plt.figure(1) plt.hist(points, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(e, 2500, 50)) plt.title('Distribution of Points) 5 = plt.figure(2) plt.hist(salaries, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 3.5 * 10**7, 2.5 * 10**6)) plt.title( 'Distribution of Salaries') def compute_statistics(points_and_salary_data, draw=True): if draw: histograms (points_and_salary_data) points = np.average (points_and_salary_data.get('Points').values) salary = np.average (points_and_salary_data.get('Salary').values) avg_points_salary_array = np.array([points, salary]) return avg_points_salary_array 1 Python We can use this compute_statistics function to show the distribution of 'Points' and 'Salary' and compute their means, for any collection of players. Run the next cell to show these distributions and compute the means for all NBA players. Notice that the array containing the mean 'Points' and mean’Salary' values is displayed before the histograms, and the numbers are given in scientific notation. full_stats = compute_statistics(full_data) full_stats ] Python
array([5.00071138e+02, 4.26977577e+06]) Distribution of Points 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 0 500 1000 1500 2000 2500 le-7 Distribution of Salaries 2.00 1.75 1.50 125 100 0.75 0.50 0.25 0.00 0.0 0.5 10 15 2.0 2.5 3.0 le7
Now, imagine that instead of having access to the full population of NBA players, we had only gotten data on a smaller subset of the players, or a sample. For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky. Instead, we often make statistical inferences about a large underlying population using a smaller sample. A statistical inference is a statement about some characteristic of the underlying population, such as "the average salary of NBA players in 2014 was $3 million". You may have heard the word "inference" used in other contexts. It's important to keep in mind that statistical inferences can be wrong. A common strategy for inference using samples is to estimate parameters of the population by computing the same statistics on a sample. This strategy sometimes works well and sometimes doesn't. The degree to which it gives us useful answers depends on several factors. One very important factor in the utility of samples is how they were gathered. Let's look at some different sampling strategies. B... O Convenience sampling One sampling methodology, which is generally a bad idea, is to choose players who are somehow convenient to sample. For example, you might choose players from a team that's near your house, since it's easier to survey them. This is called, somewhat pejoratively, convenience sampling.
Next, we'll compare the distribution of points in our convenience sample with distribution of points for all players in our dataset. D D D B ... # just run this cell, don't change it def compare_points(first, second, first_title, second_title): ***"Compare the points in two DataFrames. """ bins = np.arange(0, 2500, 50) first.plot(kind='hist', y='Points', binsbins, density=True, ec='w', color='blue', alpha=0.5) plt.title('Points Distribution for + first_title) second.plot(kind='hist', y='Points', bins=bins, density=True, ec='w', color='blue', alpha=0.5) plt.title('Points Distribution for + second_title) compare_points(full_data, convenience_sample, 'All Players', 'Convenience Sample')
Points Distribution for All Players Points 0.0025 0.0020 0.0015 Frequency 0.0010 0.0005 0.0000 0 500 1000 1500 2000 2500
Simple random sampling A more principled approach is to sample uniformly at random from the players. If we ensure that each player is selected at most once, this is a simple random sample without replacement, sometimes abbreviated to "simple random sample" or "SRS". Imagine writing down each player's name on a card, putting the cards in a hat, and shuffling the hat. To sample, pull out cards one by one and set them aside, stopping when the specified sample size is reached. We've produced two samples of salary_data in this way: small_srs_salary.csv and large_srs_salary.csv contain, respectively, a sample of size 67 and a larger sample of size 150. Now we'll run the same analyses on the small simple random sample, the large simple random sample, and the convenience sample. The load_data function below loads a salary table and merges it with player_data. The subsequent code draws the histograms and computes the means for 'Points' and 'Salary'. A ...
# Don't change this cell, but do run it. def load_data( salary_file): return player_data.merge(pd.read_csv( salary_file), left_index=True, right_on='PlayerName') small_srs_data = load_data("data/small_srs_salary.csv") large_srs_data = load_data('data/large_srs_salary.csv') = = small_stats compute_statistics(small_srs_data, draw=False); large_stats compute_statistics (large_srs_data, draw=False); convenience_stats compute_statistics(convenience_sample, draw=False); = ', full_stats) print('Full data stats: print('Small SRS stats:', small_stats) print('Large SRS stats:', large_stats) print('Convenience sample stats: ', convenience_stats) = color_dict { 'small simple random': 'blue', "large simple random': 'green'. 'convenience': 'orange' } plt.subplots(3, 2, figsize=(15, 15), dpi=100) i = 1 for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()): plt.subplot(3, 2, i) i += 2 plt.hist(df.get('Points"), density=True, alpha=0.5, color=color_dict[name], ec='w', bins=np.arange(0, 2500, 50)); plt.title(f'Points histogram for {name} sample') i = 2 for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()): plt.subplot(3, 2, i) i += 2 plt.hist(df.get('Salary'), density=True, alpha=0.5, color=color_dict[name], ec='w', bins=np.arange(0, 3.5 * 10**7, 2.5 * 10**6)); plt.title(f'Salary histogram for {name} sample') #plt.show()