Vehicle Miles Traveled vs. Crime Incidence, Tony Heath & Kyle Liming

Tony Heath & Kyle Liming

Math 216: Statistics & Probability for Engineers

Final Project Proposal

March 26, 2012

Each year in the United States we are driving almost three trillion miles on our national roadway system. This amount has been steadily growing over the last half century at a rate of almost 49 billion vehicle miles traveled (VMT) per year.[1] In 2006 there were 1.4 million violent crime offenses and over 10 million property crime offenses as tabulated by the FBI.[2] Using the statistical technique of linear regression the correlation and relationship between these two figures will be established, and the causes of this correlation will be questioned.

The first relationship that will be explored between these two data sets is how they correlate through time. The Federal Highway Administration (FHWA) has been collecting and aggregating vehicular travel data nationally, by state, and by region for every year since its founding in 1967. Likewise the Federal Bureau of Investigation (FBI) has been reporting violent and property crime data in its “Uniform Crime Report” since 1960. The second relationship to be explored is between VMT and crime incidence by location. Every urban area in the United States with a population greater than 500,000 is required to have a Metropolitan Planning Organization (MPO). The statistical technique of linear regression will allow us to examine the correlation between increases in VMT and crime incidence both over time and by location.

The primary statistical technique that will be used to answer the questions above is linear regression. For the relationship by location, we will plot VMT as the independent variable and crime incidence as the dependent variable with each data point representing a city. The cities chosen will range from megalopolises such as New York and Chicago to smaller cities such as Huntsville, Alabama. The relationship between total VMT and crime incidence will be examined, and the relationship between VMT and crime incidence per capita will also be explored. By exploring the relationship on a per capita basis, the correlation that might likely arise from the hidden variable of city population will be eliminated. Although the hidden variable of population size is likely have less of an impact when considering how they correlate over a period of time, the same process of examining both total and per capita VMT and crime incidence rates will be used to examine the relationship between VMT and crime incidence over time.

It may seem obvious that as people drive more there is likely to be an increase in the amount of vehicle related crime; however, by using linear regression and other statistical techniques the relationship between overall crime and vehicle usage will be explored. As a society becomes more mobile does the corresponding rise in affluence lead to a more safe and orderly civilization, or does the privacy that cars afford lead people to cut themselves off from society at large and strike out in a chaotic and violent manner?

[1] Bowman, M. (2008, June 04). Vehicle travel and emissions. Retrieved from

[2] Federal Bureau of Investigation, (2010). Crime in the united states. Retrieved from website:

Who are the worst drivers?

Joe Cassiere

Kevin Johnson

Michael Thomas

Who are the Worst Drivers?

Statistical analysis is important for civil engineers in developing our nation’s transportation systems.  Speed limits, turning radii, traffic signaling and more are all designed to reduce the number of accidents on the roadways. Also, car insurance companies rely on statistical evidence to set rates at which they charge their customers. For the average motorist, it’s common to think that everyone else on the road is a bad driver.  Often stereotypes are made about certain people being worse on the road than others.  Which demographic, though, has the worst drivers?  Are men or women more likely to get into accidents?  Does one specific state or area of the country have more accident-prone drivers?  Are younger drivers more likely to crash than older ones, or vice versa?  Using statistical techniques, we will attempt to answer these questions.

To start, we will be using data from the US Census Bureau’s website.  The website supplies information on accidents by age, state, gender and several other categories.  For the question of age, we will create a null hypothesis that claims that one age group is worse at driving than the others; while the alternate hypothesis is that all age groups generally get into the same amount of accidents.  By setting up confidence intervals using the overall mean of the accident rates by the age groups, we will be able to tell which age groups are worse (or better) drivers.

To address the question of which state has the best drivers, we would be able to solve this question by utilizing hypothesis testing.  Our null hypothesis would be that most states would have similar driving fatalities. This would lead to the number of fatalities through states having a low variance. Once we determine a variance that is suitably low, we would then be able to establish an alternate hypothesis, which says the variance is higher. Addressing this question with hypothesis testing allows us to determine if all states have roughly the same amount of driving fatalities per driver.  Also, to test the driving skill of drivers by state, we could first find the overall mean and standard deviation of car accidents per capita in the United States by averaging all the states’ accidents per capita.  Then, we could construct confidence intervals of 90%, 95%, and 99%.  By comparing each state’s accident rate to the intervals, we will be able to tell which states have significantly better or worse drivers.

To test whether men or women are better drivers, we can use hypothesis testing again.  We can let the null hypothesis be that neither men nor women are better drivers than the other.  Numerically, we can say that the mean percentage of accidents that are men’s fault is 50%.  The alternate hypothesis can be that either men or women are worse drivers than the other gender.  Numerically, the alternate hypothesis is that the mean percentage of accidents that are men’s fault greater than or less than 50%.

For a finished product, we wish to use at least two important infographs. The comparison of drivers by state could be effectively depicted in a heat map of the United States. The shade of each state on the map will reflect the number of crashes per capita for its population. To address other factors simultaneously, we hope to put together a meaningful mosaic plot. This type of data visualization will be useful for trying to make predictions for the entire United States population based on our available data sets. For example, the mosaic could reflect that 22 year old men from the Northeast are the “most likely” to be in a fatal car accident. Out preliminary research has found that there is a significant amount of data regarding the blood alcohol content (BAC) of driving in fatal accidents. We will keep this data in mind as we more forward with our analysis in case our findings lack the appropriate complexity. BAC data could provide us with enough data for a strong linear regression analysis. In conclusion, we feel confident that both the quantity and quality of data regarding car accidents will appropriately fit the scope of the project and we look forward to the results of our analysis.

Project proposal

Ejebagom John Ojogbo
Omotoyosi Taiwo
Tara Welytok



The success of a movie is usually predicted based on its sales from the opening weekend. Our application project aims to test the claim that the opening weekend is the highest grossing weekend for a movie and (in general) is a strong indicator of the movie’s future success. Thus, the key question this project will be trying to answer is this: “Is the first weekend the most successful in a movie’s runtime?”.

To test this claim, we shall be using the data from the file Weekend Movie Box Office Receipts (movieweekend.dat), acquired from the Journal of Statistics Education’s data archive. The dataset contains 49 movies that opened in theaters from 1977 up to 2007. The movies were taken from a variety of sources and they include Academy Award Best Picture winners, movies in series such as the Harry Potter collection, highest grossing movies, and pictures from the Sundance Festival. For each movie in the set, the data shows the following properties: the name of the movie; the week observed, e.g. the first week or the sixth; the weekend gross per theater (in dollars); and the weekend date. The weekend date used in data refers to the Friday at the start of the weekend. Weekdays are not included in the dataset, but are not needed to test our claim.

To find answers to the question posed above, we shall apply hypothesis testing. Our null and alternative hypotheses are defined as follows:

Ho = “the first weekend is the most successful”

Ha = “the first weekend is not the most successful”

We define the most successful weekend as the weekend that has the highest percentage of the gross box office earnings in dollars. Therefore we shall be analysing the contributions of the first weekends of each movie in the dataset using their proportions to the entire box office gross. From the calculated values we expect to obtain a variety of confidence intervals, with special focus on 90%,  95% and 99%. These would give a very clear picture of how strong the results of our hypothesis tests are. It has been observed that after certain weeks the viewership (and by extension, the box office receipts) drops considerably. Adequately determining this point for the various movies would simplify the analysis process by reducing the amount of weeks to be parsed.

We shall observe how many movies in the set have their opening weekends as their highest grossing, and using proper weighting and analysis tools we will use this to determine whether or not to reject our null hypothesis. In the course of testing and analysis we expect to discover if there are any correlations between a certain weekend’s gross and total box office receipt, and which weekend (if not the opening) is in fact the most successful in the theatrical life of a movie. We also expect to see trends concerning which week in a movie’s lifetime viewership declines considerably, and whether this can be applied to all/most movies in the dataset.

1. McLaren, C., DePaolo, C. (2009). Movie Data. Journal of Statistics Education Volume 17. Retrieved from

Analysis of Tolerance Levels of Resistors

Rob Jackson, Gayashan Ediriweera, and Rocky Gray

Recently, we purchased a set of 860 resistors with 86 different values from Joe Knows Electronics for a senior design project. Joe Knows claims that these particular resistors have a 1% tolerance rating. We’ve tested a few resistors and they seemed to be within the tolerance rating. However, we purchased these resistors at an extremely low price and were actually expecting a 5% tolerance rating. We think that a relatively high percentage of the resistors (about 15%) will . So, we would like to test whether or not Joe really does know electronics.
Resistors are a key component used when constructing any type of circuit, whether it is simple or complex. The current that flows through a resistor is in direct proportion to the voltage across the resistor’s two terminals. Thus, the ratio of the voltage applied across a resistor’s terminals to the intensity of current through the circuit is called resistance.
With our experience with resistors as computer engineers an extremely important component included with resistors is the tolerance level that it maintains. Many resistors are built based with a tolerance level that varies from .1% to 10%. For example a pack of 12 resistors could be built to have 100 ohms flow through it with a 1% tolerance level. This means that any resistor with a value of 99 to 101 ohms would be acceptable. This understanding of tolerance levels leads us to the basis of our experiment, whether our 860 resistors are within the tolerance level of 1%.
Our first job will be to collect data about the resistors. There are 86 different values within the 860 resistors, so there will be 10 resistors per value. We will test the resistance of the resistors within their respective groups and calculate the variance with respect to the expected values. This will help us determine whether or not the set of resistors are within the 1% tolerance level of their respective value. We will then find, as a whole, how many of the 860 resistors are actually within the 1% tolerance level.
Our hypothesis is that the tolerance level of the set of resistors is greater than 1%. If we set H0 : µ=1% and Ha: µ > 1%, then we’d shift the burden of proof on ourselves. But we want to shift the burden of proof over to Joe Knows so we would set H0: µ = 1% and Ha: µ < 1%, however, this would not confirm our hypothesis; it would only enable us to reject the claim that the tolerance is less than 1%, but what if it is exactly 1% (which would still mean their claim is valid)? We are still researching how to set up our hypothesis test.
By using our confidence intervals and hypothesis tests we should be able to figure out if Joe truly knows electronics. It will also help us to figure out how large the range of values for resistors actually is. Since these are integral pieces to all circuits they affect all electronic devices that people use. As computer engineers we also want to know if the tolerance level actually does affect our circuits or if its just used as a means of covering up discrepancies in manufacturing.

Project Proposal: Electric vs. Gas Cars

Elizabeth Hill, Connor Baizan, and Tyler Cooksey
Math 216 Application Project
Professor Bruff

Electric cars have always been advertised as being the most cost efficient and ‘green’ car on the market but there are really no statistics that compare electric car costs and environmental impacts while taking all of the data into consideration. All of the current statistics display the average cost per mile but they don’t take into account onetime costs, like replacing the battery. Companies like Nissan, who produces the Nissan Leaf, estimate that the battery lasts between 5-10 years. Since the Nissan Leaf has only been around since 2010 there is no exact data confirming that statement. So for this project we will have to assume that Nissan’s estimate is correct. We will try and see if electric cars are truly more cost efficient by comparing these popular car models, Nissan Leaf, 2012 Chevy Volt, 2012 Toyota Camry Hybrid LE, Ford Fusion, and KIA Optima.

Using the base model cost of each car found on each car’s respective website and data found on EPA’s website , each data set can be compared. We will also have to make a few assumptions about the cost of replacing a battery. Since these cars are so new there is no set price for a battery replacement, so again we will have to use estimates that the car companies have stated. In order to prove that the Nissan Leaf is in fact the most cost efficient car on the market we will use hypothesis testing. The null hypothesis is that all other cars have an efficiency less than or equal to Nissan Leaf’s efficiency and the alternate hypothesis is that the efficiency is greater than the Nissan Leaf. Another thing current statistics don’t take into account is the price of the car to begin with. Electric cars cost around $25,000 while a similar gasoline one costs around $18,000. So in order for electric cars to be truly more cost efficient they would have to save enough money to make up that difference. In order to collect statistics about initial costs of cars, similar sized cars will be researched and the resulting price of the base car will be used to create an average for electric and gasoline cars. On top of just the base cost of the car, the cost of gas or electricity for each car will also have to be factored in.

Another aspect that electric cars are known for is their ‘green’ factor. Companies that produce electric cars claim that the car has no emissions, but simply because the car itself doesn’t emit any emissions doesn’t mean they don’t exist. The power required to charge the battery is produced in a power plant, which produces emissions. There is no data that takes the power plant into account, so we want to see the rate of emissions per kWh power plants produce and then use that to find how many emissions are produced per mile. That will then be compared to the emissions rate of a gasoline powered car. A hypothesis test will then be used in order to prove which kind of car is really more ‘green’. The null hypothesis will be that the electric car emissions rate is less than the gasoline rate, and the alternate hypothesis is that the electric car rate is greater than or equal to the gasoline rate.

In conclusion, we are trying to figure out if, for the average consumer, an electric car is of good value, or if it’s merely hype, by way of hypothesis testing. We should be able to tell if it’s possible to save money by purchasing an electric car, and if it’s better for the environment as well.

EPA. (2012, 03 07). Test Car List Data Files. Retrieved 03 25, 2012, from United States Environmental Protection Agency:

Calorie Cost Project

Math 216 Research Project
March 25, 2012
How Much Does a Calorie Cost?
A Case Study on Vanderbilt Munchie Marts

Project Proposal:

Vanderbilt students’ dining options center around the Meal Plan program. As First Years, food is plentiful and easily accessible, allowing for three meals a day, seven days a week. Sophomores and Juniors, however, are limited by ever decreasing plans that restrict to 14 or even 8 meals a week. Ensuring students can get enough calories while on a limited food budget can require careful planning and research. How much food does the Vanderbilt Meal Plan really provide? This report explores the relationship between Munchie Mart food costs and the amount of calories this food contains.

The FDA suggests that college age student should consume on average 1800 – 2500 calories per day, depending on gender, body type and activity level (Wayne 2011). As average food costs for the nation continue to rise, it is important to make economical decisions when buying food. How much would it cost to supply a student the requisite caloric intake by shopping at the on campus Munchie Marts? What are the most economical options on the meal plan? We will gather data from in-person visits to the store and record price, calories, weight, fat, and carbohydrate content for each item available in the store. We will also record whether it counts as a Meal Plan Entree, Side, or neither.

The primary question we will answer is whether or not a week’s worth of trips to the Munchie Mart can provide enough calories to meet FDA requirements. We will take a mathematical average of all the entree and side options and run a confidence test against the FDA suggestions. Given that a sophomore student has a default meal plan of two meals a day, each meal purchase should average at least 1000 calories. We will test whether an average meal contains the requisite 1000 calories or if Vanderbilt Dining does not provide students with enough food for their basic calorie needs. We will run confidence tests around the 1000 calorie mark with a null hypothesis of H0: u >= 1000 calories and an alternative hypothesis of HA: u < 1000. We are structuring our hypotheses to avoid a Type II error that results in undernutrition. If it turns out that the average meal is not sufficient to provide a student their daily needs, it is important to know that careful planning may be required. Alternatively, students may be encouraged to re-evaluate their meal plan selections for the coming year, especially if they do not plan on grocery shopping on the side. Other questions we will attempt to answer include: what is the most efficient use of a single meal; is getting three sides instead of one entree and two sides that bad of an economic choice; if you must purchase with meal money, what is the cheapest dollar per calorie meal option?

We hope this information will allow Vanderbilt students to make the most of their food options as well as provide an educational study on making smart economic decisions. It is important to understand how our consumer dollars are being spent, especially when we graduate and start buying our own groceries.


Abdul Kamaruzaman
Ben Draffin


Wayne, Jake. (2011). FDA recommended calories. Retrieved from

The Contract Year Effect: Fact or Fiction?

Anurag Bose, Nicholas Gould, Lucas Kunsman

In sports, the conventional wisdom is that a player’s performance in the last year of their contract is inflated because they are trying to increase their value as a free agent.  This is commonly referred to as the “contract year effect.”  However, does this supposed effect hold merit or is it based on solely the eye test and not backed up by numbers?

Major League Baseball provides an ideal platform to answer this question.  Since Bill James published his first Abstract in 1985, the sport has evolved to judge a player’s offensive and defensive performance with more advanced statistics than batting average and errors.  We propose using one of these measures, FanGraphs WAR (Wins Above Replacement), in testing the contract year effect.  The idea behind WAR is fairly similar to its name; it values the wins gained by a team if a player played a full season for them as opposed to if a “replacement level” player played the season for them.  It is a plus/minus statistic, where a replacement level player has a value of zero, and the further a value is from zero on the positive side, the more value a player provided his team in that year.  This statistic was chosen because it looks at all aspects of a player’s game in determining their value, as opposed to only their ability to hit for power or only their ability to steal bases, and is a measure of worth that can be applied to both pitchers and batters. (1)

We propose gathering WAR data from (2) 10 random free agents from for the free agent signing periods of 2009, 2010, and 2011 (3). We will use a random number generator and the list of free agents from ESPN to generate a sample of 30 players.  We will compare their established average WAR over each players’ careers up until their contract year to their WAR from their contract year. We will use a hypothesis test with the null hypothesis that there is no contract year effect, and that the WAR of the contract year will match the players’ established career average WAR.  The alternative hypothesis will be that the WAR does not equal the established career average WAR, so that we may investigate whether any contract year effect, positive or negative, exists.

Potential limitations in this project include the varying performance of a player over their career; WAR from season to season can change dramatically. To counteract this we will consider all seasons of the player’s career as our baseline, and the contract year as the sample that we are testing the null and alternative hypotheses with.  Once these tests have been completed, we can construct confidence intervals of different significance levels. If the player’s WAR in their contract year falls outside of the confidence interval then the null hypothesis can be rejected for the alternative hypothesis. We can then look at individual players who have a statistically significant contract year and determine if there are any cases that interest us especially.

Works Cited

(1) Slowinski, S. (2010, February 15). What is WAR? Retrieved from

(2) Fangraphs. (2011). Baseball Player Search. Retrieved from

(3) ESPN. (2012). 2011 MLB Free Agents. Retrieved from

Exploring Relationships in Body Dimensions

Irene Hukkelhoven, Zachary Sanicola, Natalie Thoni

For years, the Body Mass Index (BMI) has been used to quantify obesity. Recently, questions have been raised concerning the appropriateness of using this index to define someone’s healthy weight range. A particularly daunting concern stems from the medical insurance company practice of using an applicant’s BMI as one of the terms to compute and justify that person’s insurance premium. Moreover, some researchers insist that the “BMI is bogus” [1]. Our goal is to investigate the validity of this concern by redefining a healthy weight range and to determine if the BMI falsely labels a person as obese. Although de-emphasized, we will also discuss the applications of skeletal measurements in gender determination and define size-ranges for manufacturers of retail and ergonomic goods.

The dataset [2] is comprised of 507 total individuals. Specifically, our sample population contains 247 men and 260 women, all of whom complete several hours of exercise a week. The majority of the subjects were in their twenties and early thirties at the time of sampling, though the dataset also contains a handful of older men and women. To concentrate on a more specific demographic, we will eliminate those men and women over a certain age. As such, we define the population as all physically active young adults.

Several questions we will attempt to address are as follows:
1. Is height a good indicator of weight? How accurately does the BMI assign under- or overweight statuses?
2. What skeletal bone is the best indicator of gender?
3. How many units of each shirt size should a clothing retail store order to ensure they will not be overstocked? How big should an airline make their seats to accommodate the smaller 95% of the population?

Studies have already shown that height is a poor indicator of weight, so we predict that our linear regression correlation will not be near an absolute value of 1. We will show that there are other body measurements with improved correlation values that can be used to better predict a subject’s “scale weight”. To justify the higher accuracy of using other body measurements, we will use hypothesis testing on a randomly selected group within our sample population (two for each individual; H_0: u = x; H_A: u =/ x, where u is the scale weight and x is the weight determined by (1) solely height and (2) other body measurements) and calculate and compare respective p-values. Furthermore, we will attempt to define a more rigorous equation (than BMI) for determining whether an individual is within a healthy weight range, thus answering questions about the value of the obesity index for individuals whose body build is atypical for their height.

Contrary to popular belief, pelvic measurements are not the most reliable data for determining the gender of skeletal remains. Using histograms, we will uncover which skeletal measurement best evidences a male or female body. Specifically, we will seek to identify those body parts whose normal distribution curves for male versus female subjects overlap the least. Such information can be useful in forensic science and anthropological studies (i.e. identifying remains of a missing person, or shedding light on ancient cultural burial rituals)

To figure out what quantity of each shirt size a buyer for a clothing store should order, we will calculate five two-sided confidence intervals to relate weight to shirt size (XS, S, M, L, XL). Then, we can use a probability density function to determine how many units of each shirt size a buyer should order without fear of being overstocked. In a similar manner, confidence intervals can be employed to calculate the size of an airplane seat suitable for the smallest 95% of the population (this will be a one-sided interval, since a seat big enough to fit the largest people will automatically also fit the smallest people).

In conclusion, our project will primarily highlight the topic of health, obesity, and BMI. As a secondary focus, we will look into the applications of skeletal measurements for gender determination as used in forensic and anthropological research. Lastly, we will address how our data can be used to provide useful information to clothing and furniture manufacturers.


1. Devlin, Keith. “Top 10 Reasons Why the BMI Is Bogus.” National Public Radio, 4 July 2009. Web. 25 Mar. 2012. <>.

2. Heinz, Grete, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk. “Exploring Relationships in Body Dimensions.” Journal of Statistics Education 11.2 (2003).Amstat. Web. 25 Mar. 2012. <>.

Bracketology: How does Seeding determine Success?

Lester Primus, Taylor Madison, Julian White

In the spirit of March, the best application of statistics is to analyze the famed NCAA College Basketball tournament, also known as “March Madness”. Every year thousands of people of all ages rush to their favorite sports website to fill out a bracket in hopes that they could be the one to correctly predict the outcome of each game throughout the tournament. Though there has never been one who correctly predicted 100% of the games played in the tournament, there have been reports of people guessing the entire first round correctly. The number of teams participating in the current system is 68 which is an increased amount from the 68 in 2011 which was also increased from the 64 in 1985. This means that today, there are 67 total games played and for a team to be crowned champions they must win at least six games.

The possibilities of combinations of surviving teams each year seems almost endless, especially to those doing their best to win competitions with their friends. Teams are ranked or “seeded” based on several factors from their season.  These factors include their win-loss record, their performance in their conference tournament, their strength of schedule, and the number of ranked teams they defeated along with the number of unranked teams by who they were beaten. The teams ranked number one in each region are the top four teams over all, with the number two seeds being teams five through eight overall, and so on. Therefore, upon first glance, it is intuitive to predict that the higher ranked team will always win their game. Past history, year after year, has proven otherwise. The victory of a lower seed over a higher seed is known as an “upset”. Lower seeded teams that commit multiple upsets to remain in the tournament are known as “Cinderella” teams. We hope to use statistics to answer the many questions of the common sports fan. We will analyze the relationship of seeds with likelihood of victory as well as the potential of one seed to teams to upset another.

First, we will determine the relationship between a team’s seed and the number of games they win in the tournament, if there is one at all. A linear relationship would display the teams more likely to win, that is, the higher seeded, do win. This could easily be set up in confidence intervals and hypothesis tests. If one were to pick the winners of each game purely based on seeding how likely would that person be correct, and at what significance level? This is done by finding out the average number of games a higher seed tends to win each year and compare it to the average number of games a lower seed wins. Because there are sixteen different seeds several seed match ups can be compared.

A second question can be to determine if a lower seed are motivated by the “upset factor” to win their game. For example, out of the 32 first round games, would an upset occur in at least half the games proving the lower seeds more likely to win in total? The null hypothesis is equal to 16 while the alternate hypothesis is less than 16.

The number one and two seeds are considered to be the dominant teams of the tournament and are almost never picked to lose the first round. A number one seed has actually never lost a first round game going back to the first NCAA tournament games in 1939. A final question can determine the likelihood of a number one or two seed making the Final Four (this is the semi-finals of the tournament). This probability can be found by treating each year as a sample and finding the number of number one seeds that make it to the Final Four each year. A confidence interval can be found using these sample means and standard deviations.

Project Proposal: Alessandra, Medhi, Sonja

Household energy consumption is an important element in household expenditures and environmental impact. Understanding the factors that drive energy consumption is important to determine what affects it the most and how to improve efficiency. Energy consumption is a crucial issue worldwide because of the limited availability of energy resources. Because the United States is a world leader and among the top two countries of mass energy consumers worldwide[1], knowing the patterns and distributions of energy consumption in the U.S. could help affect energy worldwide.

The United States Energy Administration conducts period surveys of household energy consumption and publishes it in a document “Home Energy Use and Costs”. This data describes the many different factors that drive energy consumption in households including geographic location, demographics, etc. It would be interesting to hypothesize why households in different geographic locations use more or less energy, as well as which demographics affect household energy consumption. From the data provided, energy consumption varies dramatically by both geographic location and race and income.

First, we will analyze the data from 2005[2] to observe geographic variance, determining which location uses the most energy per household and the size of the standard deviation between data values. The same will be done for the data for race and income. To determine which factor drives the household energy consumption the most, we will compare the variability in each factor – the higher the variability, the more dependent the consumption on that factor.

According the data, the northeast region shows the highest energy consumption values per household at 122.2 million Btu’s per household. In order to verify this, we will conduct a statistical analysis that will yield the percentile value of energy consumption associated with the northeastern region. This would give us an idea of how much greater northeast energy consumption is compared to the rest of the regions in the United States. For household consumption by income, households whose incomes are $100,000 or more consume the most energy compared to lower income levels at 130.5 million Btu’s per household. The trend appears to be that the greater the income the more energy consumed. We can test the linearity of this trend. From the data for household consumption by race, non-hispanic white households consume the most energy at 99.9 million Btu’s per household. However, it seems as though the energy consumption does not vary as much as it does for geographic location and income.

For each factor, we will attempt to determine whether the data fits to a normal distribution or if it is skewed. Then, we will find the mean and standard deviations for the data and if the data is normal, calculate the z-scores based on these values. Next, for each factor, the percentage will be calculated to determine what the probability is that energy consumption will be a certain value based on that factor.

After the locations, races and incomes with the highest and lowest household energy consumptions are determined, it can be hypothesized why this is the case. For example, certain locations have higher temperatures that may result in higher energy consumption for cooling. For race, it may be generalized that some races are not as concerned with sustainability or educated on the use of energy. For households with higher income, they may use more energy due to more appliances and other commodities that result from a more expendable income. Lastly, these factors may be interdependent so a more in depth study would look at the ways they affect one another.


Energy Information Administration. (2005). 2005 residential energy consumption survey: energy consumption and expenditures tables. Retrieved from

Swartz, S., & Oster, S. (2010, July 18). China tops U.S. in energy use. Retrieved from