# Preview of Chapters 5 and 6

At the end of class last Friday, I pointed you to a set of problems meant to preview Chapters 5 and 6. Since you'll be proposing your application projects before we've had a chance to cover all of Chapters 5 and 6, I wanted to give you a sense of the kinds of statistical questions (that we haven't already seen) that you can tackle in your projects. Here's that list of preview problems again. Below I've classified them by type of problem.

1. Hypothesis test for a population mean, large sample size - These are the hypothesis tests we've talked about in class.
2. Hypothesis test for the difference in two population means, with paired data, large sample size
3. Hypothesis test for the difference in two population means, with unpaired data, large sample size - Can you tell the difference in "paired data" and "unpaired data" based on these two problems?
4. Hypothesis test for a population proportion (not a population mean), large sample size
5. Hypothesis test for the difference in two population proportions, large sample size
6. Hypothesis test for a population mean, small sample size - Right now, we can handle small sample sizes only if we know the underlying population is normally distributed. Later, we'll learn how to handle small sample sizes under other conditions.
7. Hypothesis test for the difference in two population means, small sample size
Image: "Binoculars," vestman, Flickr (CC)

# More on the Choice of Alternate Hypotheses

Here's a longer version of my explanation of that first clicker question today, for those interested:

Clicker Question: Water samples are taken from water used for cooling as it is being discharged from a power plant into a river. It has been determined that as long as the mean temperature of the discharged water is at most 150 degrees Fahrenheit, there will be no negative effects on the river's ecosystem. To investigate whether the plan is in compliance with the regulations that prohibit a mean discharge water temperature above 150 degrees, 50 water samples will be taken at randomly selected times, and the temperatures of each sample recorded. Which of the following hypothesis tests should be used?

1. H0: μ = 150 vs. HA: μ < 150
2. H0: μ = 150 vs. HA: μ > 150

If we were to get a low p-value (say, p=0.02) with option 1, then that would be strong evidence that the water temperature is less than 150 degrees. (There would be only a 2% chance that we would get water temperatures as low as, say, 145 degrees if the temperature were really greater than 150. Since this probability is so low, we would conclude that the water temperature must be lower than 150.)

If we were to get a low p-value (say, p=0.02) with option 2, then that would be strong evidence that the water temperature is greater than 150 degrees. (There would only be a 2% chance that we would get water temperatures as big as, say, 155 degrees if the temperature were really less than 150. Since this probability is so low, we would conclude that the water temperature must be greater than 150.)

Now suppose you're the power plant owner, and you want to avoid an unnecessary fine by the EPA for high discharge water temperatures. You would want to use option 2 since you'll want to see strong evidence that your water temperature is too high. If you got a low p-value, then that would be strong evidence that your water temperature is too high, so you would go along with the EPA fine. If you got a high p-value, then that wouldn't be strong evidence that your water temperature is too high, so the EPA presumably wouldn't fine you. A low p-value minimizes the chance of an unnecessary fine.

Now suppose you're Green Peace and you really don't want high water temperatures to kill fish. You would want to use option 1 since you'll want to see strong evidence that the water temperature is low enough. If you got a low p-value, then that would be strong evidence that the water temperature is low enough, so you would be okay with the EPA not fining the power plant. If you got a high p-value, then there would not be strong evidence that the water temperature is low enough, so you would want the EPA to go ahead and fine the power plant. A low p-value minimizes the chance that the power plant will "get away" with too-high water temperatures.

In practice, you would typically collect your water samples and compute the sample mean. If the mean were bigger than 150, then if you're the power plant, then you would want to conduct an option-2 hypothesis test to see if there's really sufficient evidence that your water is too hot. If the mean were bigger than 150 and you're Green Peace, then you would just say, Fine the power plant.''

If the mean were lower than 150, then if you're the power plant, then you would tell the EPA not to fine you since your water is cool enough. If the sample mean were lower than 150 and you're Green Peace, then you would want to conduct an option-1 hypothesis test to see if there's really sufficient evidence that the water is cool enough.

Image: "Backwash 5," Pulpolux, Flickr (CC)

# More on the Monty Hall Problem

Once again, here's the Monty Hall problem:

Monty Hall offers you a choice of three doors.  Behind two are goats, and behind the other door is a brand new car.  After you make your choice, Monty opens one of the doors you didn’t choose to reveal a goat, and offers you the chance to switch your choice to the one remaining.  What should you do?  (Assume Monty knows where the car is and always opens a door with a goat after the contestant chooses.)

1. Switch
2. Stay
3. It doesn't matter.

Many people think that once Monty opens a door to reveal a goat, that with only two doors remaining, their odds of winning the car are 50-50.  That explains why many people think it doesn't matter if you stay or switch.  Some people think that perhaps Monty is trying to talk them out of sticking with their original door because it contains the car.  These people are likely to think that staying with their first choice gives them better than 50-50 odds.

However, the correct way to think about this problem is that you had a 1/3 chance of guessing right on your first try. After Monty opens a door and reveals a goat, there's still a 1/3 chance that you were right, which means there's still a 2/3 chance that you were wrong. Before Monty opened that door, that 2/3 chance that the car was not behind your door was split between two doors. Now, however, there's a 2/3 chance that the car is behind the remaining door that you didn't choose. So you should switch. On average, you would expect to win a car 2/3 of the time if you switch.

Here's the tree diagram I showed briefly in class. It helps to model this random process in something of a chronological order. First, the car is placed behind one of the three doors randomly, with each door being equally likely.  Then, you select your door, again with each door being equally likely. Then, Monty opens one of the other doors to reveal a goat. (It's important that Monty knows where the goats are here! Otherwise you get a different tree diagram.) When you look at the outcomes, not all are equally likely because Monty's choices are limited in some cases. Of all the outcomes, weighted by their respective probabilities, your choice of door was correct only 1/3 of the time. Thus, switching doors is the better strategy.

To hear what Monty Hall himself had to say about this problem, check out this New York Times interview. One thing made clear in that interview is that Monty did indeed know where the goats were. But, in contrast to the problem as I stated it above, he wasn't required to always open a second door and offer contestants the chance to switch. If the contestant chose a door with a goat, Monty didn't have to make any deals!

Image: "Goats Only," by me

# Conceptual Populations and Population Means

The clicker question today about the relationship between a sample mean () and a population mean () was a challenging one. Here's a bit more context for that question.

Suppose you weigh a rock on a scale several times, each time getting a slightly different reading. Assuming that the physical characteristics of the rock aren't changing, then these readings can be considered a random sample, taken from a conceptual population consisting of all the readings that the scale could in principle produce.

Finding the sample mean () for this sample is straightforward: you add up the readings and divide by the number of readings. However, the population mean () isn't so straightforward. It's not possible to add up all the readings that the scale could in principle produce, much less divide by the number of such readings.  That's why I objected to the statement in the clicker question that said that sample means and population means are calculated in the same way. You can't always "calculate" a population mean.

So what do we mean by "population mean" for conceptual populations? One way to think of it is to define the population mean as the mean of a sample that (somehow, miraculously) follows the population distribution perfectly. However, that requires defining a "population distribution," which we haven't done yet. For now, perhaps it's better to think of the population mean as the expected value of the population.

That rock has, in a sense, a "true" weight. That's the expected value of the population. We can't calculate that or even really know what it is. But we can calculate the mean of our sample of readings, and that sample mean is likely to be close to the "true" weight of the rock. Thus the population mean for this conceptual population is this "true" weight, since it's the expected value for any sample we might take.

Hope that helps, at least a little.

Image: "Dravite," Craig Elliott, Flickr (CC)