Social Bookmarking Assignment #6

For your next social bookmarking assignment, find and bookmark a well-designed infographic. Some of you bookmarked an infographic for your first social bookmarking assignment, the #dataviz one. This time, I want everyone to find a good infographic so that you'll all have examples of infographics to inform your work on your application projects. (If you bookmarked an infographic earlier, you'll need to find a new one for this assignment!)

Not sure what an infographic is? Here are some examples of this genre of data visualization. (The first three were all bookmarked by Math 216 students earlier in the semester. The last one was designed by a senior at Queen's University.)

For Diigo users, tag your bookmark with "infographic." For Pinterest users, include the term "#infographic" in your pin's description.

Also: Those of you using Diigo need to add a picture to your Diigo profile. Any picture will do--it doesn't actually have to be a picture of you. If you're logged into Diigo, you can edit your picture using this link (I think). The Diigo group won't feel so sterile once we get rid of all your default avatars there.

Deadline: To get credit for this assignment, complete it by Friday, April 6th, before class begins.

Optional: Not all infographics are good infographics. "Ending the Infographic Plague" is a recent essay by Megan McArdle, a senior editor at The Atlantic. It's worth a read.

Image: "Interesting Pin," Derek Bruff, Flickr (CC)

Vehicle Miles Traveled vs. Crime Incidence, Tony Heath & Kyle Liming

Tony Heath & Kyle Liming

Math 216: Statistics & Probability for Engineers

Final Project Proposal

March 26, 2012

Each year in the United States we are driving almost three trillion miles on our national roadway system. This amount has been steadily growing over the last half century at a rate of almost 49 billion vehicle miles traveled (VMT) per year.[1] In 2006 there were 1.4 million violent crime offenses and over 10 million property crime offenses as tabulated by the FBI.[2] Using the statistical technique of linear regression the correlation and relationship between these two figures will be established, and the causes of this correlation will be questioned.

The first relationship that will be explored between these two data sets is how they correlate through time. The Federal Highway Administration (FHWA) has been collecting and aggregating vehicular travel data nationally, by state, and by region for every year since its founding in 1967. Likewise the Federal Bureau of Investigation (FBI) has been reporting violent and property crime data in its “Uniform Crime Report” since 1960. The second relationship to be explored is between VMT and crime incidence by location. Every urban area in the United States with a population greater than 500,000 is required to have a Metropolitan Planning Organization (MPO). The statistical technique of linear regression will allow us to examine the correlation between increases in VMT and crime incidence both over time and by location.

The primary statistical technique that will be used to answer the questions above is linear regression. For the relationship by location, we will plot VMT as the independent variable and crime incidence as the dependent variable with each data point representing a city. The cities chosen will range from megalopolises such as New York and Chicago to smaller cities such as Huntsville, Alabama. The relationship between total VMT and crime incidence will be examined, and the relationship between VMT and crime incidence per capita will also be explored. By exploring the relationship on a per capita basis, the correlation that might likely arise from the hidden variable of city population will be eliminated. Although the hidden variable of population size is likely have less of an impact when considering how they correlate over a period of time, the same process of examining both total and per capita VMT and crime incidence rates will be used to examine the relationship between VMT and crime incidence over time.

It may seem obvious that as people drive more there is likely to be an increase in the amount of vehicle related crime; however, by using linear regression and other statistical techniques the relationship between overall crime and vehicle usage will be explored. As a society becomes more mobile does the corresponding rise in affluence lead to a more safe and orderly civilization, or does the privacy that cars afford lead people to cut themselves off from society at large and strike out in a chaotic and violent manner?


[1] Bowman, M. (2008, June 04). Vehicle travel and emissions. Retrieved from http://www.slideshare.net/marcus.bowman.slides/vmt-and-emissions-19572050

[2] Federal Bureau of Investigation, (2010). Crime in the united states. Retrieved from website: http://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2010/crime-in-the-u.s.-2010

Who are the worst drivers?

Joe Cassiere

Kevin Johnson

Michael Thomas

Who are the Worst Drivers?

Statistical analysis is important for civil engineers in developing our nation’s transportation systems.  Speed limits, turning radii, traffic signaling and more are all designed to reduce the number of accidents on the roadways. Also, car insurance companies rely on statistical evidence to set rates at which they charge their customers. For the average motorist, it’s common to think that everyone else on the road is a bad driver.  Often stereotypes are made about certain people being worse on the road than others.  Which demographic, though, has the worst drivers?  Are men or women more likely to get into accidents?  Does one specific state or area of the country have more accident-prone drivers?  Are younger drivers more likely to crash than older ones, or vice versa?  Using statistical techniques, we will attempt to answer these questions.

To start, we will be using data from the US Census Bureau’s website.  The website supplies information on accidents by age, state, gender and several other categories.  For the question of age, we will create a null hypothesis that claims that one age group is worse at driving than the others; while the alternate hypothesis is that all age groups generally get into the same amount of accidents.  By setting up confidence intervals using the overall mean of the accident rates by the age groups, we will be able to tell which age groups are worse (or better) drivers.

To address the question of which state has the best drivers, we would be able to solve this question by utilizing hypothesis testing.  Our null hypothesis would be that most states would have similar driving fatalities. This would lead to the number of fatalities through states having a low variance. Once we determine a variance that is suitably low, we would then be able to establish an alternate hypothesis, which says the variance is higher. Addressing this question with hypothesis testing allows us to determine if all states have roughly the same amount of driving fatalities per driver.  Also, to test the driving skill of drivers by state, we could first find the overall mean and standard deviation of car accidents per capita in the United States by averaging all the states’ accidents per capita.  Then, we could construct confidence intervals of 90%, 95%, and 99%.  By comparing each state’s accident rate to the intervals, we will be able to tell which states have significantly better or worse drivers.

To test whether men or women are better drivers, we can use hypothesis testing again.  We can let the null hypothesis be that neither men nor women are better drivers than the other.  Numerically, we can say that the mean percentage of accidents that are men’s fault is 50%.  The alternate hypothesis can be that either men or women are worse drivers than the other gender.  Numerically, the alternate hypothesis is that the mean percentage of accidents that are men’s fault greater than or less than 50%.

For a finished product, we wish to use at least two important infographs. The comparison of drivers by state could be effectively depicted in a heat map of the United States. The shade of each state on the map will reflect the number of crashes per capita for its population. To address other factors simultaneously, we hope to put together a meaningful mosaic plot. This type of data visualization will be useful for trying to make predictions for the entire United States population based on our available data sets. For example, the mosaic could reflect that 22 year old men from the Northeast are the “most likely” to be in a fatal car accident. Out preliminary research has found that there is a significant amount of data regarding the blood alcohol content (BAC) of driving in fatal accidents. We will keep this data in mind as we more forward with our analysis in case our findings lack the appropriate complexity. BAC data could provide us with enough data for a strong linear regression analysis. In conclusion, we feel confident that both the quantity and quality of data regarding car accidents will appropriately fit the scope of the project and we look forward to the results of our analysis.

Project proposal

Ejebagom John Ojogbo
Omotoyosi Taiwo
Tara Welytok

 

APPLICATION PROJECT PROPOSAL

The success of a movie is usually predicted based on its sales from the opening weekend. Our application project aims to test the claim that the opening weekend is the highest grossing weekend for a movie and (in general) is a strong indicator of the movie’s future success. Thus, the key question this project will be trying to answer is this: “Is the first weekend the most successful in a movie’s runtime?”.

To test this claim, we shall be using the data from the file Weekend Movie Box Office Receipts (movieweekend.dat), acquired from the Journal of Statistics Education’s data archive. The dataset contains 49 movies that opened in theaters from 1977 up to 2007. The movies were taken from a variety of sources and they include Academy Award Best Picture winners, movies in series such as the Harry Potter collection, highest grossing movies, and pictures from the Sundance Festival. For each movie in the set, the data shows the following properties: the name of the movie; the week observed, e.g. the first week or the sixth; the weekend gross per theater (in dollars); and the weekend date. The weekend date used in data refers to the Friday at the start of the weekend. Weekdays are not included in the dataset, but are not needed to test our claim.

To find answers to the question posed above, we shall apply hypothesis testing. Our null and alternative hypotheses are defined as follows:

Ho = “the first weekend is the most successful”

Ha = “the first weekend is not the most successful”

We define the most successful weekend as the weekend that has the highest percentage of the gross box office earnings in dollars. Therefore we shall be analysing the contributions of the first weekends of each movie in the dataset using their proportions to the entire box office gross. From the calculated values we expect to obtain a variety of confidence intervals, with special focus on 90%,  95% and 99%. These would give a very clear picture of how strong the results of our hypothesis tests are. It has been observed that after certain weeks the viewership (and by extension, the box office receipts) drops considerably. Adequately determining this point for the various movies would simplify the analysis process by reducing the amount of weeks to be parsed.

We shall observe how many movies in the set have their opening weekends as their highest grossing, and using proper weighting and analysis tools we will use this to determine whether or not to reject our null hypothesis. In the course of testing and analysis we expect to discover if there are any correlations between a certain weekend’s gross and total box office receipt, and which weekend (if not the opening) is in fact the most successful in the theatrical life of a movie. We also expect to see trends concerning which week in a movie’s lifetime viewership declines considerably, and whether this can be applied to all/most movies in the dataset.

References:
1. McLaren, C., DePaolo, C. (2009). Movie Data. Journal of Statistics Education Volume 17. Retrieved from http://www.amstat.org/publications/jse/v17n1/datasets.mclaren.html

Analysis of Tolerance Levels of Resistors

Rob Jackson, Gayashan Ediriweera, and Rocky Gray

Recently, we purchased a set of 860 resistors with 86 different values from Joe Knows Electronics for a senior design project. Joe Knows claims that these particular resistors have a 1% tolerance rating. We’ve tested a few resistors and they seemed to be within the tolerance rating. However, we purchased these resistors at an extremely low price and were actually expecting a 5% tolerance rating. We think that a relatively high percentage of the resistors (about 15%) will . So, we would like to test whether or not Joe really does know electronics.
Resistors are a key component used when constructing any type of circuit, whether it is simple or complex. The current that flows through a resistor is in direct proportion to the voltage across the resistor's two terminals. Thus, the ratio of the voltage applied across a resistor's terminals to the intensity of current through the circuit is called resistance.
With our experience with resistors as computer engineers an extremely important component included with resistors is the tolerance level that it maintains. Many resistors are built based with a tolerance level that varies from .1% to 10%. For example a pack of 12 resistors could be built to have 100 ohms flow through it with a 1% tolerance level. This means that any resistor with a value of 99 to 101 ohms would be acceptable. This understanding of tolerance levels leads us to the basis of our experiment, whether our 860 resistors are within the tolerance level of 1%.
Our first job will be to collect data about the resistors. There are 86 different values within the 860 resistors, so there will be 10 resistors per value. We will test the resistance of the resistors within their respective groups and calculate the variance with respect to the expected values. This will help us determine whether or not the set of resistors are within the 1% tolerance level of their respective value. We will then find, as a whole, how many of the 860 resistors are actually within the 1% tolerance level.
Our hypothesis is that the tolerance level of the set of resistors is greater than 1%. If we set H0 : µ=1% and Ha: µ > 1%, then we’d shift the burden of proof on ourselves. But we want to shift the burden of proof over to Joe Knows so we would set H0: µ = 1% and Ha: µ < 1%, however, this would not confirm our hypothesis; it would only enable us to reject the claim that the tolerance is less than 1%, but what if it is exactly 1% (which would still mean their claim is valid)? We are still researching how to set up our hypothesis test.
By using our confidence intervals and hypothesis tests we should be able to figure out if Joe truly knows electronics. It will also help us to figure out how large the range of values for resistors actually is. Since these are integral pieces to all circuits they affect all electronic devices that people use. As computer engineers we also want to know if the tolerance level actually does affect our circuits or if its just used as a means of covering up discrepancies in manufacturing.

Project Proposal: Electric vs. Gas Cars

Elizabeth Hill, Connor Baizan, and Tyler Cooksey
Math 216 Application Project
Professor Bruff
3/25/12

Electric cars have always been advertised as being the most cost efficient and ‘green’ car on the market but there are really no statistics that compare electric car costs and environmental impacts while taking all of the data into consideration. All of the current statistics display the average cost per mile but they don’t take into account onetime costs, like replacing the battery. Companies like Nissan, who produces the Nissan Leaf, estimate that the battery lasts between 5-10 years. Since the Nissan Leaf has only been around since 2010 there is no exact data confirming that statement. So for this project we will have to assume that Nissan’s estimate is correct. We will try and see if electric cars are truly more cost efficient by comparing these popular car models, Nissan Leaf, 2012 Chevy Volt, 2012 Toyota Camry Hybrid LE, Ford Fusion, and KIA Optima.

Using the base model cost of each car found on each car’s respective website and data found on EPA’s website , each data set can be compared. We will also have to make a few assumptions about the cost of replacing a battery. Since these cars are so new there is no set price for a battery replacement, so again we will have to use estimates that the car companies have stated. In order to prove that the Nissan Leaf is in fact the most cost efficient car on the market we will use hypothesis testing. The null hypothesis is that all other cars have an efficiency less than or equal to Nissan Leaf’s efficiency and the alternate hypothesis is that the efficiency is greater than the Nissan Leaf. Another thing current statistics don’t take into account is the price of the car to begin with. Electric cars cost around $25,000 while a similar gasoline one costs around $18,000. So in order for electric cars to be truly more cost efficient they would have to save enough money to make up that difference. In order to collect statistics about initial costs of cars, similar sized cars will be researched and the resulting price of the base car will be used to create an average for electric and gasoline cars. On top of just the base cost of the car, the cost of gas or electricity for each car will also have to be factored in.

Another aspect that electric cars are known for is their ‘green’ factor. Companies that produce electric cars claim that the car has no emissions, but simply because the car itself doesn’t emit any emissions doesn’t mean they don’t exist. The power required to charge the battery is produced in a power plant, which produces emissions. There is no data that takes the power plant into account, so we want to see the rate of emissions per kWh power plants produce and then use that to find how many emissions are produced per mile. That will then be compared to the emissions rate of a gasoline powered car. A hypothesis test will then be used in order to prove which kind of car is really more ‘green’. The null hypothesis will be that the electric car emissions rate is less than the gasoline rate, and the alternate hypothesis is that the electric car rate is greater than or equal to the gasoline rate.

In conclusion, we are trying to figure out if, for the average consumer, an electric car is of good value, or if it’s merely hype, by way of hypothesis testing. We should be able to tell if it’s possible to save money by purchasing an electric car, and if it’s better for the environment as well.

Bibliography
EPA. (2012, 03 07). Test Car List Data Files. Retrieved 03 25, 2012, from United States Environmental Protection Agency: http://www.epa.gov/otaq/tcldata.htm

Calorie Cost Project

Math 216 Research Project
March 25, 2012
How Much Does a Calorie Cost?
A Case Study on Vanderbilt Munchie Marts

Project Proposal:

Vanderbilt students’ dining options center around the Meal Plan program. As First Years, food is plentiful and easily accessible, allowing for three meals a day, seven days a week. Sophomores and Juniors, however, are limited by ever decreasing plans that restrict to 14 or even 8 meals a week. Ensuring students can get enough calories while on a limited food budget can require careful planning and research. How much food does the Vanderbilt Meal Plan really provide? This report explores the relationship between Munchie Mart food costs and the amount of calories this food contains.

The FDA suggests that college age student should consume on average 1800 - 2500 calories per day, depending on gender, body type and activity level (Wayne 2011). As average food costs for the nation continue to rise, it is important to make economical decisions when buying food. How much would it cost to supply a student the requisite caloric intake by shopping at the on campus Munchie Marts? What are the most economical options on the meal plan? We will gather data from in-person visits to the store and record price, calories, weight, fat, and carbohydrate content for each item available in the store. We will also record whether it counts as a Meal Plan Entree, Side, or neither.

The primary question we will answer is whether or not a week’s worth of trips to the Munchie Mart can provide enough calories to meet FDA requirements. We will take a mathematical average of all the entree and side options and run a confidence test against the FDA suggestions. Given that a sophomore student has a default meal plan of two meals a day, each meal purchase should average at least 1000 calories. We will test whether an average meal contains the requisite 1000 calories or if Vanderbilt Dining does not provide students with enough food for their basic calorie needs. We will run confidence tests around the 1000 calorie mark with a null hypothesis of H0: u >= 1000 calories and an alternative hypothesis of HA: u < 1000. We are structuring our hypotheses to avoid a Type II error that results in undernutrition. If it turns out that the average meal is not sufficient to provide a student their daily needs, it is important to know that careful planning may be required. Alternatively, students may be encouraged to re-evaluate their meal plan selections for the coming year, especially if they do not plan on grocery shopping on the side. Other questions we will attempt to answer include: what is the most efficient use of a single meal; is getting three sides instead of one entree and two sides that bad of an economic choice; if you must purchase with meal money, what is the cheapest dollar per calorie meal option?

We hope this information will allow Vanderbilt students to make the most of their food options as well as provide an educational study on making smart economic decisions. It is important to understand how our consumer dollars are being spent, especially when we graduate and start buying our own groceries.

Authors

Abdul Kamaruzaman
Ben Draffin

References

Wayne, Jake. (2011). FDA recommended calories. Retrieved from http://www.livestrong.com/article/298080-fda-recommended-calories/

The Contract Year Effect: Fact or Fiction?

Anurag Bose, Nicholas Gould, Lucas Kunsman

In sports, the conventional wisdom is that a player’s performance in the last year of their contract is inflated because they are trying to increase their value as a free agent.  This is commonly referred to as the “contract year effect.”  However, does this supposed effect hold merit or is it based on solely the eye test and not backed up by numbers?

Major League Baseball provides an ideal platform to answer this question.  Since Bill James published his first Abstract in 1985, the sport has evolved to judge a player’s offensive and defensive performance with more advanced statistics than batting average and errors.  We propose using one of these measures, FanGraphs WAR (Wins Above Replacement), in testing the contract year effect.  The idea behind WAR is fairly similar to its name; it values the wins gained by a team if a player played a full season for them as opposed to if a “replacement level” player played the season for them.  It is a plus/minus statistic, where a replacement level player has a value of zero, and the further a value is from zero on the positive side, the more value a player provided his team in that year.  This statistic was chosen because it looks at all aspects of a player’s game in determining their value, as opposed to only their ability to hit for power or only their ability to steal bases, and is a measure of worth that can be applied to both pitchers and batters. (1)

We propose gathering WAR data from fangraphs.com (2) 10 random free agents from ESPN.com for the free agent signing periods of 2009, 2010, and 2011 (3). We will use a random number generator and the list of free agents from ESPN to generate a sample of 30 players.  We will compare their established average WAR over each players’ careers up until their contract year to their WAR from their contract year. We will use a hypothesis test with the null hypothesis that there is no contract year effect, and that the WAR of the contract year will match the players’ established career average WAR.  The alternative hypothesis will be that the WAR does not equal the established career average WAR, so that we may investigate whether any contract year effect, positive or negative, exists.

Potential limitations in this project include the varying performance of a player over their career; WAR from season to season can change dramatically. To counteract this we will consider all seasons of the player’s career as our baseline, and the contract year as the sample that we are testing the null and alternative hypotheses with.  Once these tests have been completed, we can construct confidence intervals of different significance levels. If the player’s WAR in their contract year falls outside of the confidence interval then the null hypothesis can be rejected for the alternative hypothesis. We can then look at individual players who have a statistically significant contract year and determine if there are any cases that interest us especially.

Works Cited

(1) Slowinski, S. (2010, February 15). What is WAR? Retrieved from http://www.fangraphs.com/library/index.php/misc/war/

(2) Fangraphs. (2011). Baseball Player Search. Retrieved from http://www.fangraphs.com/players.aspx?lastname=

(3) ESPN. (2012). 2011 MLB Free Agents. Retrieved from http://espn.go.com/mlb/freeagents/_/type/ranked.