# Reading Assignment #17 - Due 4/6/12

Here's your next reading assignment. Read Section 5.4 in your textbook and "Ending the Infographic Plague" by Megan McArdle and answer the following questions by 8 a.m., Friday, April 6th. Be sure to login (using the link near the bottom of the sidebar) to the blog before leaving your answers in the comment section below.

1. The authors suggest that if a sample size is less than 10% of the population size, we can treat the observations in the sample as independent. Why is that reasonable?
2. When H0: p1 = p2, why does it make sense to use the pooled estimate of the sample proportion instead of the two individual sample means?
3. Given McArdle's concerns about infographics, what are one or two things you can do when designing your application project infographic to make it an effective one?

## 47 thoughts on “Reading Assignment #17 - Due 4/6/12”

1. 1. I think the idea is sampling without replacement. The first time you draw a spade out of a deck of cards p=13/52=0.25. When drawing the second card, there is a lower probability you will get a spade (0.23) and a higher chance you will get any other suit (0.255). If you sample too large of a population without replacement, the decrease in the overall pool to choose from starts affecting the probability of success.

2. p1 could be higher than u1 or p2 could be lower than u2. By pooling, we are finding a more likely probability, because it doubles the number of samples.

3. I think the most important factor is to check your sources and make sure their data is valid. Second, make sure you are making reasonable assumptions, and state them.

2. 1) This comes from Section 2.4 where it was said that when a sample size is only a small fraction of the population (under 10%), observations are nearly independent even when sampling without replacement.
2) It allows us to get a more precise estimate of the standard deviation of a shared proportion by combining the estimates of our independent samples. Additionally, it is easy to compute.
3) Make sure you get a valid data from reliable sources, and make sure your infographic accurately and truthfully shows this data.
4) I would like to see more examples of hypothesis tests dealing with difference of two proportions. I am also not sure that my answers to #1 and #2 completely answer those questions.

3. 1. if you are only taking a small sample out of the population, when you take people out it doesn't effect the overall probability very much. There will still be at least 90% of the population keeping the probability of each outcome reasonably the same regardless of what the sample has shown.
2. We don't know p1 and p2 and since the hypothesis is the same, its easier to use p hat and combine both sets of data.
3. We will make sure that all of our data comes from legitimate, easily identifiable sources and we won't lie.
4. I would like to know what the infographic author would consider to be the qualifying factor for a good infographic.

4. 1. If the population size is low, choosing 1 observation might affect the probability of choosing another. For example, if there are only 10 students at Vanderbilt, and we're choosing 5 to make a sample, after we choose 1 student, the probability of choosing another is now 1/9 instead of 1/10, so the observations are not independently chosen.

2. If we pool the results of both samples, we'll get a p hat value that is closer to the true p because having more values means we average out the extremes.

3. Use reliable sources that relate clearly to the data you're displaying. Don't make your infographic seem like it's talking about some kind of disaster. Don't stretch data just to make your infographic grab someone's attention.

4. None

5. 1. It reduces the potential interaction between the members of the sample.
2. Since p1 =p2, it makes since to use an equation that takes of advantage of that equality.
3. We should cite our sources and make sure that they agree with each other (not picking and choosing our data weirdly).
4. none

6. 1. Because if the sample size is small enough, you are most likely not going to choose the same members of the population over again for each sample. Therefore, the observations are independent.

2. Since using the pooled estimate allows you to conduct a test that includes both individual sample means combined into one proportion, it makes sense to use the pooled estimate to compute one p-value. Otherwise, you couldn't compare the two sample means with one hypothesis test.

3. We can make sure that we are not misrepresenting the data and make sure that the data comes from a reliable source.

7. 1) This is reasonable because a sample larger than 10 percent could start showing trends other than the one being tested for. The sampling of data could have been done at multiple times, and by taking more than 10 percent, you increase the chances that those discrepancies are due to lurking variables from the experiment.
2) That was actually my question but I'll give it a shot anyways. Since the test is designed assuming the null hypothesis (H0: pn = pc), we can assume that they are equal for our SE.
3) Find consistent data from a single source, not a set of data combined from multiple sources. Ensure that the experiments run to collect that data were unbiased.
4) I'm still not quite sure if my answer to 2 is correct, and I admittedly am not too sure about my answer to 1 either. So both of those would be good questions. đź™‚

8. 1) Because as the sample size becomes larger, the outcome of the observations will be easily predicted based on the previous outcomes.

2)Since the null hypothesis is that there is no difference, using two individual sample means will contradict the hypothesis itself and more prone to error.

3)Do not lie and exaggerate the data.

4)No question.

9. 1. This reasonable because 10% is a small enough percentage to conclude that the sample in question will not be connected (dependent) in any way. The greater the percentage, the more of a chance the sample elements have some connection.

2. We may not always know the two sample means so we can get a good estimate by pooling the results.

3. Make sure the data and the sources of the data are legit. Some sites just create bogus info-graphics to raise their query status on google.

4. A little confused on why we pool. I don't think I understand #2.

10. 1. Since it is a random sample, and less than 10% of a population is a relatively small proportion, the randomness of the sample will keep the different observations independent.
2. Since p1 = p2, we can say that all the successes and failures from each sample could be considered to be in the same sample. Therefore, we use the pooled estimate for greater accuracy.
3. For one, we need to make sure our data is accurate and from reputale sources. Also, we must not try to use our data to make it seem as though there is an increasing danger with car accidents (or something of the sort) if there is not.
4. I kind of expected more examples of visually bad infographics from the article. Where could I see some of those?

11. 1) Because the sample size is less than 10% of the population, we can treat the population using a normal model and treat the sample as independent.

2) Because p1 = p2 in the null hypothesis, the standard error should reflect this equality, which is why we use a pooled estimate.

3) Infographics should explicitly state what the data represents ("amenable deaths" vs. "deaths"). Infographics also should not have unnecessary graphics on them that distract you from the actual data.

12. 1. Because less than ten percent makes it so that knowing one observation does not give any idea of what another observation would be. This makes the observations in the same independent.

2. Because the null hypothesis is that there is no difference between the two sample groups, it means that the pooled estimate should be the same as each individually and it is useful to obtain a better estimate for the p.

3. Do not insinuate things are awful or mislead the reader to believe something without the data to completely back it up and do make the sources for the data easily accessible by not hiding it at the bottom and not just putting the url.

13. 1. I am really not sure. The book just brushes over it without really explaining. My only guess is that a small sample size means that there is less of a chance of a part of the sample having an effect on the other.

2. It makes sense because if both of the p's are equal to one another, then you might as well combine the estimates of p to get one estimate that is ultimately better because of the larger number of observations

3. Keep the sources consistent and easy to check. Also, avoid misleading and blunt conclusions.

4. I can't find out why when less than 10% of the population sampled is the observations are considered dependent.

14. 1. According to the footnote on page 147, â€śThe choice of 10% is based on the findings of Section 2.4; observations look like they were independently drawn so long as the sample consists of less than 10% of the populationâ€ť. With that being said, Iâ€™m still really not sure why 10% is reasonable.
2. When there is a certain statistics you donâ€™t know, itâ€™s relatively efficient to obtain a good estimate of it by pooling the results. This makes sense because you are essentially adding in more data, and the more data you have, the more accurate your estimate will be.
3. (1) donâ€™t make up data â€“ do the research and donâ€™t misrepresent the facts; (2) donâ€™t make it too long or complicated because the viewer will get tired of reading
4. Pretty straight forward section, I donâ€™t have a question.

15. 1) We assume independence when a sample size is less than 10% of the population because we can assume little to no interaction between datapoints in the sample that would bias the dependent variables at low proportions of the total population

2) Because we are assuming that the proportions are equal in the two cases that we are comparing, we can assume that the proportion in either block of the sample equals the proportion of the entire sample.

3) Many of the points she presented seemed to be about proper representation of data and reliability of source data. In constructing our infographic, we can combat these by making sure that we use statistically viable methods to produce the infographic, and cite relevant, knowledgeable sources for our data that we perform statistical operations on

4) What are some ways not mentioned in the article that you would suggest to make our infographics more effective?

16. 1. A sample size less than 10% of the population size can be treated as if the observations were independent because this lowers the chance that subgroups are included. If a sampling included people who wore sunglasses and people who did not, then if a sampling were greater than 10%, there is a greater chance that one of these two subgroups are wholly included while the other one is not. Lowering the percentage of the population ensures that the observations are truly random, while increasing the amount of observations allows us to assume a normal distribution. Less than 10% allows us to fit the sample into both these categories.
2. It makes sense to use pooled estimates of the sample proportion instead of the two individual sample means when H0: p1 = p2 because we are assuming that the proportions are equal to one another. Since hypothesis tests are conducted assuming the null hypothesis is true, then the proportions should be equal to one another. Thus pooling the estimates should not affect the proportion, in this case, but rather strengthen it as increasing the number of observations brings us closer to the actual value of p.
3. In order to make the infographic an effective one, we can make sure that all of our sources are cited correctly. This means that the data is presented in a way which represents the data in a truthful matter as well as putting sources that are easy to check. We will also make sure these sources are relevant to our projects. Another thing we can do is not to create a sensationalist title that makes it seem that a panic must be incited. We will try to present our facts truthfully and our sources in an easy to check matter without causing the reader to be in a panic.
4. Are the results of such a hypothesis test affected when the results can be paired?

17. 1. If you take multiple samples the individual items in each sample have a low chance of being picked for two different samples.
2. Since H0: p1 = p2, and since in our hypothesis test we assume that the null hypothesis is true, we need to combine information from both p1 and p2 to find a common value p=p1=p2.
3. Don't distort statistics: i.e. be sure we're comparing comparable data, don't make wild claims (or imply them), and be sure of the accuracy of our sources.
4. Is the rest of the processes of hypothesis testing when p1=p2 the same as it is normally (since the book doesn't finish out the example)?

18. 1. I would assume this is reasonable due to the small size of the population compared to that of the total population. It is not likely that the individuals in this small sample affect each other.
2. In order to get an estimate of the rate of both groups you can "pool" the two sample groups when they are equal to each other. This new proportion can then be used to better compute the standard error.
3. I would say two things that will help our infographics better after reading this article would be to first make sure all the information being put in the infographic is accurate and can linked to an appropriate source. I also think a main point being made in this article is that information in infographics needs to be clear and percentages put in need to be understood in the context of what is being represented in the graphic.
4. I was a little confused on how the standard error was computed for the differences in sample proportions.

19. 1. This is reasonable because the sample size is small enough relative to the population to not affect the individual observations so that these individual observations will still be independent from the rest.
2. It makes sense to use the pooled estimate of the sample proportion instead of the instead of the two individual sample means because the hypothesis is that the two are equal, giving them the same mean, effectively if the hypothesis isn't rejected.
3. We can be sure to cite our sources properly, and not make huge connections that are very loosely affiliated to base a majority of the premise on those false assumptions. Additionally, we are definitely related to the subject matter, since our project is about students and students' academic performances, and we won't be posting it randomly to the website, which was the main concern of McArdle's article.
4. In example 5.24, does the large difference between the number of Republicans vs. Democrats polled not five skewed results? Why wouldn't you just poll the same number of people who identify with each political viewpoint?

20. 1) Because n is large enough and it almost act as normal distribution
2)we use it to compute standard error
3)-check if the source you are using reliable, is the data make sense? Make sure the infographic has a legitimate content rather than just funky presentation without zero input
4)When are we suppose to use pool estimate, I know when p1=p2 but what is that really mean?

21. 1) The sample is independent because the selection of the sample is random.
2) We use the pooled estimate of a proportion to verify the success-failure condition and also to estimate the standard error. You don't know the exposure rate to you find a good estimate of it by pooling the results of both samples.
3) Make sure that the data is accurate and up to date. Also make sure to include all the samples or population's data instead of excluding data to prove a point.
4) The null hypotheses and pooling is confusing.

22. 1) Because if the sample sizes are that small, then taking a single observation without replacement will not significantly affect the proportions of the remainder population in question.

2) If H0 is false, then p1 != p2. Pooling the proportions will then result in an average proportion between p1 and p2, and will be unequal to either. Else, the pooled proportion will equal p1 and p2 and H0 will be true.

3) I can put clear sources next to the relevant data, not intentionally misinterpret data to make a point, and try to be as thorough as possible when reporting data.

4) If we take a one sided test where H0: p1 = p2, can we still pool the proportions?

23. 1. If the sample is size is relatively small compared to the population size, then selecting one observation does not significantly effect the outcome of the other observations, hence each observation can be considered independent. 10% is the magic number that has been somehow determined by statisticians so that a sample meets this condition.

2. In a hypothesis test, the sample is always viewed as if the null hypothesis were true. In the case where H0: p1=p2, it is assumed that the two population proportions are equal therefore the calculation of the standard error must also follow this assumption. To make this the case, a pooled estimate of the population proportion is used instead of the two individual sample means.

3. Make the source information clear and legible.

4. In the prison/Princeton infographic, do the numbers for higher education spending include what the government spends plus what the public spends? All prison funding comes from the government whereas not all education is government funded.

24. 1) If the sample size is less than 10% of the population then the odds of those chosen for p1 affecting those chosen for p2 are very slim

2) We assume that the null hypothesis is true so p1 and p2 are the same

3) You should make sure to clearly list your sources

25. 1. The authors suggest that if a sample size is less than 10% of the population size, we can treat the observations in the sample as independent. Why is that reasonable?
Because the change in the probability for each observation would be so small that itâ€™s negligible
2. When H0: p1 = p2, why does it make sense to use the pooled estimate of the sample proportion instead of the two individual sample means?
This allows you to estimate the exposure rate p.
3. Given McArdle's concerns about infographics, what are one or two things you can do when designing your application project infographic to make it an effective one?
The main thing would be to make sure the data is accurate and comes from an accurate source. Then, we should cite it properly so that no one doubts our infographicâ€™s information.
How do you test other null hypotheses?

26. 1. I have no idea...
2. Under the condition that P1 = P2ďĽŚP1 hat tends to equal to P2 hat. Therefore the pooled estimate tends to be closer to the true main because the sample size is larger than either individual groups.
3. Using various color and shape to show the relationship between different data set.
4. I don't understand the Exercise 5.27. Why it is just an observation instead of experiment? Because there are two groups being divided according to the need of the research, why it is not an experiment?

27. 1. By having each sample be less than 10% of the population size, the observations are less likely to overlap and thus, the same observation has a smaller probability to be counted multiple times. Thus, the observations are not dependent on each other.

2. When creating hypothesis tests for proportions, we should use the null parameter. However, since the null parameter is not numeric in this case, we must instead approximate the overall success rate by taking the total number of successes over the total number of cases (or the "pooled estimate.")

3. Make sure our data is accurate and is not pulled from unreliable sources all across the Internet. Don't draw false conclusions from the data, correlation is not causation.

4. What are more examples of when to use the difference between two proportions method?

28. 1) The authors suggest that if a sample size is less than 10% of the population size, we can treat the observations in the sample as independent. Why is that reasonable?
> Because a sample size that small ensures that no two samples will be related.

2) When H0: p1 = p2, why does it make sense to use the pooled estimate of the sample proportion instead of the two individual sample means?
> Because we are hypothesizing that probability 1 is equal to probability 2, so we need to look at them together.

3) Given McArdle's concerns about infographics, what are one or two things you can do when designing your application project infographic to make it an effective one?
> Make sure to use correct information and document our sources well.

> Pooled estimates of sample proportion.

29. 1. If the sample is less than 10% of the population, we can reasonably assume that all trials are independent and our sample does not have any overlap.

2. If the null hypothesis is that p1=p2 then the proportion of the two samples should be similar, and the average of the two samples is a better estimate of the population.

3. In our info-graphic we will make sure that the data is authentic and that the source is readily available.

30. 1) 10% is large enough for using the normal model, but small enough that the inclusion of 1 observation in the sample doesn't affect the chances of inclusion of another observation.
2) If the null is that p1=p2, it makes sense to use the pooled estimate of the sample proportion since we are assuming the null and would need to see strong evidence to prove it false.
3) Ensure that the data is true (confirm by checking with other trusted sources)
4) No question.

31. 1) Because then the observations are independent both within the samples and between the samples

2) Because we do not know the exposure rate of "p"

3) All research should be verified as reasonable as to prevent false, ludicrous claims that would destroy the integrity of the infographic

4) None

32. 1) because it is a sufficiently small proportion of the population to not change the probability of an event.
2) Because it gives us a more accurate point estimate since there are more samples.
3) Give good, concise sources that actually have correct information easily available.
4) none

33. 1. From Section 2.4, we saw that observations were nearly independent even when sampling without replacement, when sample size is a small fraction of the population.
2. Because we don't know the exact proportion, but by pooling we get a good estimate of it.
3. Check the sources! Examine my emotional reactions!
4. ?

34. 1.) This is reasonable because samples smaller than 10% of the total population are thought to be small enough such that if there are any two elements of the population that somehow affect each other, we are unlikely to have captured them both.
2.) Using the pooled estimate of the sample proportion is justified because it represents the number of "successes" divided by the total number of cases, so we can use it in the framework of hypothesis testing when we assume that the null hypothesis is true. That is, when we do not know the common proportion p = p_c - p_n.
3.) Based on the article, when designing an infographic, one should first make sure the data represented in the graphic are accurate or defensible, at least. Creating an infographic with incorrect data is useless. Secondly, the infographic should not be overly dramatic. It should not focus on the flashiness of the graphics themselves at the cost of the quality of the data presentation. This causes confusion and, depending on the actual data, undue panic.
4.) When did infographics start being popular?

35. 1. It is reasonable because at this point you can assume that taking one sample has a negligible effect on the probability of what the next sample will be. Say you have a bag of a certain number of red and green marbles, we are claiming that if we take 10% or less of the marbles out, the proportion of red to green marbles left in the bag will still be about the same because the loss of the 10% won't have been big enough to effect it.

2. We are testing based on the results of both. By pooling the results, we are assuming that the two test are really one, and so able to treat them as one sample.

Given McArdle's concerns about infographics, what are one or two things you can do when designing your application project infographic to make it an effective one?
3. Pick sources that are well documented and have accurate information as well as to, as much as possible, not skew your data.

Not very technical, but what are the laws surrounding accuracy of infographics?

36. 1 â€“ Small sample sizes would assume that a single observation will not effect the probabilities of other observations because of not replacing.

2- If you pool the proportions it will be the average proportion between p1 and p2, and wonâ€™t be equal to either

3- Presenting data in a fair way so it is not just to make a biased point, and using sources so people will be able to check my data. Also, no pie charts. Those are for rookies.
4- none

37. Observations look like they are independently drawn for cases when we sample from less than 10% of the population. (Section 2.5)

The pooled method gives a better estimate of the standard error than by testing each of the standard errors for the different groups separately.

When creating the infographic, we should probably draw upon statistics that are readily available and not highly disputed. Also, we probably shouldn't link to data that has little to do with our project. Finally, don't choose only a piece of the data set, leaving out the rest of the information listed in the set.

38. 1) Because if the sample sizes are very small, the observation will not significantly affect the rest in the population.

2) If H0 is false, then p1 is equal to p2. Pooling the proportions will let p1 and p2 be unequal to either. The pooled proportion will equal p1 and p2 and H0 will be true.

3)Make the resource of the data very clear and be objective.

39. 1.) It is reasonable because there's a low chance that the 10% of thhe population have had some form of interaction, and even if they have, it be very minimal difference.

2.) It is better to use the pooled estimate because the sample estimates might differ greatly from each other and the larger the population sample the closer we are to the actual population proportion.

3.) Use the sources that will be cited in the info-graphics, and make our data easily accessible to readers.

4.)None

40. 1. When sample sizes are this small, they can be seen as independent since they donâ€™t affect the entire population in a large fashion.
2. Using the pooled estimate will give you a much more accurate average between the p1 and p2. Using the sample mean of the two will not be a true representation of the data.
3. Try to make sure that the information be represented is relevant and also that it is inferred correctly so the data can demonstrated in the most detailed and accurate way.
4. Are we going to get some help constructing these infographics for our assignment?

41. 1. if sample size is less than 10 percent, then taking samples without replacement won't affect the independence of the rest of the data as much.

2. a pooled estimate of the sample proportion will be in between the two individual values. it's more accurate than using the two individual values because the true value will be between the two individual values.

3. be as fair as possible and not skew the data in order to favor one side or the other.

42. 1) because we want the sample size to be large and anything less than 10% of the population would be too small of a sample size

2) because you're comparing the number of successes or failures in two different groups

3) make sure we display our statistics correctly and in a way that seems meaningful, don't just use silly pictures and put the statistics next to them

43. 1. When a sample size is only a small fraction of the population, observations are nearly independent even when sampling without replacement.
2. It makes sense to use the pooled estimate because the sample means are likely to be equal, thus not allowing for any further analysis.
3. One important thing to do that McArdle points out is not to misrepresent your sources; make sure the data is presented in a clear, concise manner.
4. I have seen this material before and thus do not have any questions on the reading.

44. 1. When the sample size is less than 10%, the changes in the probability due to changing probability are so minute they don't affect the calculation.

2. A larger sample means the SE is lower. H0 is that the mean is the same between the two samples, so pooling the two samples provides larger sample size without altering the mean, which is what we are observing.

3. In the infographic, make sure the data is connected and cite the sources legibly.

4. What is the need to treat proportions differently? The book launches into examples without giving much motivating basis.

45. 1)The observations in the samples are independent because the samples sizes are very small.
2)Pooling the proportions results in an average proportion between the two p values
3)By interpreting data honesty without intentional misinterpretation, and by reporting sources data sources
4)Why do we pool the proportions?

46. 1)
The reason for that is because the sample is too small. So that, using an observation (without doing any replacement) will not affect the proportions of the remainder population.

2)
We use the pooled proportion estimate (p ) to verify the success-failure condition and also to estimate the standard error. that means, we donâ€™t know the exposure rate (p)in the begining. That the reason we use "pooled".

3)
I would attach a link to the raw data to my points. Also, I will try to be fair (try to include the other side ideas and view point) (by saying the other side, I mean the oppsite opinion).

47. The authors suggest that if a sample size is less than 10% of the population size, we can treat the observations in the sample as independent. Why is that reasonable?

The assumption is that the samples are taken randomly; in which case, if it is less than 10% of the population size, it is highly unlikely the random group has dependencies unless the entire group has dependencies. Also, removing small amounts means that the conditional probability will only be slightly altered, thus making P(a and b) basically equal p(a)p(b)

When H0: p1 = p2, why does it make sense to use the pooled estimate of the sample proportion instead of the two individual sample means?

We are making the assumption that H0 is true. We do not know what p really is, but we are assuming p = p1= p2 since we assume H0 true. Since p1 and p2 are in fact different from our actual sample, we find their pooled proportion which averages there difference such that the pooled proportion - pooled proportion is in fact zero and takes into account both p1 and p2 since we assume h0 to be true
When we are carrying out a test, we don't know the value of p -- in fact, we are asking if there is any such single value -- so we don't claim to know the value. We calculate our best estimate of from our best estimate of p, which is "total number of successes/total number of trials

Given McArdle's concerns about infographics, what are one or two things you can do when designing your application project infographic to make it an effective one?

Design all things with a purpose. Do not have sizes vary unless it means something. If you have an x and/or y axis, it should represent something. Nothing should be depicted for a random reason.

Make sure to have clear borders between different graphics. The viewer should instantly distinguish what different messages you are trying to convey without mixing them together.