Here’s your first reading assignment. Read Sections 1.1 through 1.3 (omitting Section 1.3.7) in your textbook and answer the following questions by 8 a.m., Friday, January 13th. To answer the questions, login to the blog and leave your answers in a comment. Your comment will only be visible to you, me, and the TAs.

- What is the relationship between $$\overline{x}$$ and $$\mu$$?
- Is the given variable discrete or continuous? (a) The number of heads in 100 tosses of a coin. (b) The length of a rod randomly chosen from a day’s production. (c) The age of a randomly chosen Vanderbilt student.
- When constructing a histogram, why is the choice of bin size (that is, the size of the range of values that are placed in a single bin) important?
- What’s one question you have about the reading?

**Update: **I just (Thursday at midnight) changed a setting in the “Semi-Private Comments” plugin. The comments section of this post should now work as follows: If you’re logged into the blog and leave a comment, that comment will only be visible to you (anytime you’re logged in) and me (since I’m the admin here). If it doesn’t seem to work that way for you, let me know via email.

**Second Update: **If you have to give your name and email when trying to leave a comment, then you’re not logged in.

1. Xbar (is there an easier way to insert symbols than copying from Word?) represents the mean (average value of a series of numerical data points) of a specific sample, while mu represents the mean for an entire population. Xbar should hypothetically match mu with a small degree of error.

2. a) Discrete

b) Continuous

c) Discrete (Unless age is being taken as a decimal for high degrees of accuracy)

3. If the expected significance of the data is to be found in small variances in numbers, a large bin size will cluster otherwise significantly separated values into the same bar, eliminating the histogram’s effectiveness.

4. When calculating variance, why is the sum of values squared divided by n-1 rather than the full sample size? I could be missing an obvious mathematical fact but it wasn’t immediately apparent.

1. x bar and mu both represent means of a set of data, but x bar represents the mean of a specific sample whereas mu denotes the mean for an entire population. In a perfect world, x bar and mu would have the same value.

2. a) discrete

b) continuous

c) discrete

3. a data set can show different things when manipulated in different ways. It is important that the histogram shows the data in a way that makes sense and displays the data in some sort of helpful fashion. If the bins are too big or too small you will not be able to see the trends.

4. I found everything pretty straight forward. I was a bit confused about categorical vs ordinal data but the book said it was lumping them together anyway

1) x̅ is the mean of just a select number of data points, while µ is the mean of the whole population. So since x̅ is just a percentage of µ, µ is a more accurate representation but also a lot harder to acquire which is why x̅ is used most of the time.

2) a) discrete b) continuous c) it can be continuous or discrete depending on what unit of measurement you are using

3) The choice of bin size is important because it determines how accurate the histogram is. If the bin size is small there are a lot of different bars which might make reading the data less efficient. If the bin size is large though there are a few big bars which might not tell you enough information about the data. So you need to find a bin size that makes the data easy to read while still retaining enough information.

4) When trying to analyze data how do you decide which graph is easiest to use if there is no obvious answer?

1. Xbar represents the average of a certain sample while mu represents the average of a population.

2. a. discrete

b. continuous

c. it depends. It is discrete if you are only counting by years. But it is continuous if you can subdivide it even further.

3. An improper bin size can lead to a misleading representation of the data. Large bins can cluster data and small bins can result in connections not being as apparent as they would be with a correctly sized bin.

4. Can you clarify how to come up with a proper bin size?

1) x̅ is the sample mean. μ is the population mean. x̅ can be used as a rough estimate for μ.

2) (a) discrete (b) continuous (c) discrete (in years)

3) Choosing an appropriate bin size is important because both undersized and oversized bins make it difficult to understand the data that is being presented.

4) Is there a precise definition for what makes a given statistic robust or not?

1. Both mu and x-bar are the statistical means of a data set and are calculated in the same way: sum of the numerical observations divided by the number of observations. The difference is that x-bar is the mean of a sample while mu is the mean of a population.

2.

a) discrete

b) continuous

c) depends on units. Smaller units (e.g. seconds) will result in a continuous variable, while larger (e.g. years) will result in a discrete variable.

3. If bin size is too large, it won’t allow one to see variations in the data that they may be looking for. However, if it is too small, the histogram loses its effectiveness as a grouped representation of data and begins to resemble a dot plot.

4. How to select a proper bin size (is it a “feel” thing or are there guidelines not outlined in the text?)

1. X bar is the sample mean, whereas mu is the population mean. They are computed the same way.

2. a. discrete b. continuous c. would be written as discrete

3. improper bin size can obscure the data

4. Why would the data be transformed before changing the bin size?

1. The first refers to sampling mean or the mean of a random sample. It is a sample statistic which estimates a population parameter. In this case, the second symbol (muu) is the population parameter in question; namely the population mean which is the value of the mean of all observations in the population. Usually, it is tedious and extremely expensive to ascertain the population mean.

2. [a] Discrete – All values are non-negative integers until 100.

[b] Continuous – Length can be any non-negative value and does not have to be an integer value.

[c] Discrete – Age is an integer number in years, unless the question is specific to as to the exact moment when you were born till now (which can be expressed into nanoseconds), which is continuous.

3. The bin size must not be too large that it encompasses most of the observations or too small that it does not produce any meaningful results. Furthermore, bin size should be consistent across the entire histogram.

4. What are other measures of variability for data sets, since variance/standard deviation can be the same value for very different population distributions (such as Fig 1.17)?

1. X-bar is the average of a sample of the population. Mu is the average of the entire population. X-bar is attempting to approximate mu without collecting data on every member of the population.

2a. Discrete

b. Continuous

c. Discrete

3. The bin size should be determined based on the scale of the data and the number of data points. If you choose a bin size too large or too small, trends may be missed.

4. Why is the formula for sigma-x (population standard deviation) and Sx (sample standard deviation) different? The difference being dividing by n and n-1, respectively .

1) Both are means, but x-bar is a sample mean while mu is a population mean.

2) a) discrete b) continuous c) discrete (could also be continuous)

3) When constructing a histogram, the choice of a bin size is important because it will help to construct a histogram that is clear to see and understand. If the size is too small but we have a huge data set, then the histogram will be very disorganized. If the size of a bin is too large, then we won’t be able to see the details.

4) When is it a good idea to exclude outliers from the data?

Just a quick note, I can see everyone’s replies to this question. It appears only one of the submitted replies thus far is hidden.

The question answers are as follows:

1. x-bar represents the mean of the given sample, whereas mu represents the mean of the entire population which is represented by the aforementioned given sample.

2.

A) Discrete

B) Continuous (although the actual data taken is limited to a finite range of values since every form of measurement has some resolution which limits its accuracy, thus the actual measurements will be discrete values since the measured lengths can only be the discrete lengths which could possibly measured with the measurement device).

C) Discrete

3. Its similar to Goldilocks, if your choice of bin size is too small, you might only have one or two heights for all of your bars which will not provide you any information about the modes of the distribution. If your choice of bin size is too large, than you may include most of your data in one or two bins, which then does not show the spread and distribution and skew of the data, and will likely automatically make your distribution appear unimodal. Only by have a good choice of bins somewhere in the middle of these two extremes can Goldilocks view all aspects of the distribution… just right.

4. What is the derivation for standard deviation. Anyone can memorize the procedure to compute it, but why is it meaningful to compute a number in that particular way, what is the reasoning behind its creation via those steps? Also, who came up with multiplying the IQR by 1.5 and why?

Thanks for the heads up on the comments, Curtis. I’ve determined that the plug-in I’m using to keep these comments hidden was making its choices based on IP address. Most of the comments here are coming from the same IP address (129.59.115.1), which is probably some standard VU Wifi address. That meant that anyone using that IP address could see any comments left by other people using that IP address. Not what I intended!

I’ve changed the setting so that it checks your login info, not your IP address, so that should help going forward. Just be sure to login before leaving your comment.

1. I’m not sure how to get the special characters, so x-bar represents the sample mean and mu represents the population mean. x-bar is a set withing the population that should approximate mu. mu is the theoretical mean that would be calculated with 100% of the population.

2. a) discrete

b) continuous

c) discrete

3. The choice in bin size dictates how easily information can be read from a histogram. If the bin size is too big, the information gathered would be too general. If the bin is too small, the graph would be harder to read because histograms are usually used when there is a large data set.

4. None.

1. X is Sample Mean. Mu is population mean.

2. a. Discrete b. Continuous c. Discrete

3. Choosing a bin size too large would decrease the precision of our statistics, since a wide range of values would be lumped into one bin. Choosing a bin size that is too small can cause the natural small variance when taking measurements to factor in too largely to our data.

4. It all made sense, but the formula for variance confused me briefly. Do you really divide by n-1? Why not n?

1) xbar represents the sample mean, while mu represents the average of all observations in the population. xbar is taken from a subset of the population from which mu is taken.

2a) discrete

2b) continuous

2c) discrete

3) The bin size is important in order to find a happy medium between visualizing the data and the accuracy of the data shown.

4) n/a

1. x bar is the mean value of a certain sample. This number is computed by adding together all variables and dividing that value by the number of variables. Mu, the population mean, is computed the same way but it is the average of an entire population.

2. a) discrete. b) continuous. c) discrete.

3. The choice of bin size is important because you want to group the data using a number bins that show a proper trend for the histogram. If not, the data may not show an effective shape for the data distribution, which is the purpose of a histogram.

4. How do we know what is the best way to graphically represent a set of given data? Scatter plot, histogram?

Xbar represents the mean, or average, of the sample data. Mu represents the average of the total population. Xbar in theory should be pretty close to mu if enough random data was collected.

Discrete, Continuous, Discrete

If the bin size is too large, patterns and modes in the data could be lumped together and obscured. If the bin size is too small, you might as well use a different type of graph, because a histogram is intended to bring data together to make it easier to read.

I don’t completely understand the book’s explanation of the degrees of standard deviation. Is that the distance of each Xi from the std deviation?

1. x-bar is the sample mean, whereas mu is the population mean. X-bar is used to estimate the population mean, which is difficult to determine because it involves gathering information about every member of the population.

2. a) discrete b) continuous c) discrete if counting only years, continuous if months, days, hours, etc. are included

3. Bin size in a histogram is important because it determines how easily trends are determined from the data. If the size of each bin is too small, there will not be enough data in each bin, meaning there will be a lot of short bars. If the bin size is too large, there will be too much data in each bin, so the histogram will have a few really tall bars. Both of these histograms would make recognizing trends difficult because – depending on the bin size – each bar could end up the same height or there would be too few bars to illustrate a trend.

4. Is there any standard guideline for how to determine the bin size in a histogram?

1) x¯ (ugh, so close) represents the mean of observations in a specific sample, while μ represents the mean of observations of an entire population.

2) Discrete, continuous, discrete

3) large bin sizes could hide important discrepancies in data- small bin sizes could create a convoluted data presentation

4) Better explanation of standard deviation in general- why is it calculated the way it is? Why is n-1 used?

1. x-bar represents the mean of entire sample of data, mu represents the mean of the entire population.

2. a. discrete. b. continuous. c. discrete

3. Choosing an appropriate bin size is important because if the bin is too large or too small the data may be misrepresented. The histogram is used as a means of showing the density of data and an inappropriate bin size may skew this representation and alter the trends seen.

4. When it comes to constructing a histogram, I understand that the more data present in a bin the larger it is. What I am confused on is exactly how you know how to cap the largest bin. Is there a certain value or is there a certain way of going about this?

1. What is the relationship between x̅ and mu?

x̅ is the sample mean. This is the average of the data collected on the sample size. This is usually a fraction of the entire population.

Mu is the population mean. This is the average of all of the data which has been found for a given population. It is calculated in the same way as x̅, except that the sample size (n) will be the size of the entire population.

2. Is the given variable discrete or continuous?

a. This is discrete because it can only take numbers with jumps (integers).

b. This is continuous since the length does not have jumps in its possible values.

c. Assuming that the age is to the exact date and time of their birth, the student’s age would be continuous. However, if the age is only the number of whole years they have been alive, then it would be discrete.

3. When constructing a histogram, why is the choice of bin size important?

The bin size of a histogram determines its effectiveness. A histogram with too large of a bin size will not allow for enough differentiation between data and will hide most of the data inside the bins. A histogram with too small of a bin size will eliminate the usefulness of a histogram by making all of the bins have a very similar, and small, density.

4. In determining bin size, do you just guess and check to see which most successfully portrays your data?

1. The sample mean, x bar, helps provide a point estimate of the population mean, μ.

2. a) discrete b) continuous c) discrete

3. Choosing bin sizes too small or too large may make it harder to recognize relevant trends in the data.

4. I do not understand why we have to divide by (n-1) for the sample variance instead of just n. I think there is a concept of bias involved but I am not sure. Help?

1) x is the mean of a sample and mu is the mean of the population. Because sample is a subset of population, mu can be estimated by the value of x.

2) a) discrete b)continuous c)discrete

3) So that enough bars can be constructed to reveal the shape of the histogram.

4) Does mode only apply to histogram?

It appears I answered twice because I forget to change my display name the first time.

1) x is the mean of a sample and mu is the mean of the population. Because sample is a subset of population, mu can be estimated by the value of x.

2) a) discrete b)continuous c)discrete

3) So that enough bars can be constructed to reveal the shape of the histogram.

4) Does mode only apply to histogram?

1) What is the relationship between xbar and mu?

>xbar is the sample mean and mu is the population mean. The sample mean is an approximation of the population mean.

2) Is the given variable discrete or continuous?

(a) The number of heads in 100 tosses of a coin.

>discrete

(b) The length of a rod randomly chosen from a day’s production.

>continuous

(c) The age of a randomly chosen Vanderbilt student.

>discrete if measured in years but in theory could be continuous depending on measurement resolution.

3) When constructing a histogram, why is the choice of bin size (that is, the size of the range of values that are placed in a single bin) important?

>As the bin sizes get bigger the resolution you can see gets smaller.

4) What’s one question you have about the reading?

>”The variance is roughly the average squared distance from the mean.”

Why is the word roughly used?

1. Xbar is the sample mean. A sample is a subset of data used to represent an entire population. It is calculated by taking the sum of the numerical value measured and dividing it by the sample size, or n.

Mu is the population mean. It is the mean of the data taken from an entire population. It is calculated the same as Xbar, but contains the entire population to be modeled. A subscript is added to indicate which variable the population mean is referring to.

2.

a. Discrete

b. Continuous

c. Discrete, but continuous if exact age is taken to include months and days.

3. The size of the bin in a histogram is important for interpreting and analyzing the data. If one bin has a smaller size but has the same shape as the others, this can be misconstrued as all the bins having the same size. If the bins are too small, then there is too much data on one graph with no clear relationships. If the bins are too big, then there is not enough data to draw conclusions from. If the bin size is correctly chosen, the data will be easier to analyze and draw conclusions from.

4. What if there is more than one outlier in the data? Would that affect how the data is interpreted?

The sample mean, xBar, is the average of a distribution of data. The population mean, μ, is the average of the whole population. If you get a distribution of the whole population, its xBar is the same as μ.

(a) discrete (b) continuous (c) Depends on the resolution and honesty of the measurement. Given you round it (recommended), it is discrete.

If the size of the bin is too small, many of the bins are not full. This can lead to difficulty interpreting the distribution of the data. If the bins are too large, there is no resolution and the distribution will have large, flat dimensions that give no information.

In what situation do we throw out a point because it is an outlier?

1.) x-bar is the mean of the subset of a population. while mu is the mean of the whole population, the values of x-bar and mu are very close because x-bar is a rough estimate of mu.

2

a)discrete

b)continuous

c)continuous (because there’s always a smaller fraction of time)

3. The bin size is important because it can change the shape of a histogram, if a bin size is too small, it can lead to misleading large number of modes, whereas if it is too large it stores too many information and the histogram is hard to interpret.

4. Is there a formula to determine the bin size to use?

1. xbar is the mean of a sample while mu is the mean of the population.

2. a. discrete

b.continuous

c. discrete

3. The bin size is important in showing the data distribution. A small bin size can make it hard to determine the pattern of the distribution. A large bin size will cluster the data.

4. Does the data have to be normal for 70% of the data to be within one standard deviation of the mean?

1. Both represent the mean, however mu represents the mean of every single member of a population while xbar represents the mean of only a sample within the same population. Xbar is more widely used due to convenience.

2. Discrete: Only receive a whole number of heads

Continuous: A rod can have any measurement because of decimals

Discrete or Continuous: If age were in years, discrete but if age were in a smaller unit such as hours or seconds, anything is possible

3. A bin size shows how efficient it is to read the data. If a bin size is too small for a set of data, many bars will show and data reading will not be efficient. If a bin size is too large, there might me too many select data points for a given bin size not noting variations leading to inefficient data interpretation.

4. What leads to the formula that variance is standard deviation squared?

1. Xbar represents the mean of the data set you are given, while mu represents the actual mean of the population. For example, if you are trying to find the mean hight of everyone in Tennessee by picking a random sample of say 1000 people to measure, Xbar is the calculated mean of those 1000 people where mu is the real mean hight of everyone in Tennessee.

2. a) Discrete; b) Continuous; c) Continuous. Time in general is measured continuously. Age in years, however, would be discrete.

3) It is important because sizes of bins represent accuracy of data. Not only that, a wrong size bin be it too big or too small, for an application can represent the data in an inapplicable way. For example, if I were comparing sizes of TV’s bought in the U.S. on a yearly basis like we did in class, I wouldn’t make a bin for every TV size down to the millimeter, that’s just ridiculous. They only really make TV’s on the inch, or close to it. They’re advertised that way anyway, so 999/1000 of your bins would be empty. On the other hand, you wouldn’t make bins that represented five year increments because then you just wouldn’t be able to compare annual statistics. You couldn’t tell if a TV bought within a certain bin was from say 1990 or 1995.

4) When computing the variance, why do we divide by n-1 as opposed to just n like with the mean?

1) x ̅ represents the mean of a sample of a population, whereas μ represents the true mean of the population, which should ideally be closely approximated by x ̅.

2) a) Discrete

b) Continuous

c) It depends on how you measure age. Technically age is a continuous variable, but it is typically marked as a discrete integer value.

3) Histogram bin size determines the resolution of the data visualization. A smaller bin size (more bins) will give a higher resolution for the histogram and more clearly displays the data density.

4) Where do the formulas for variance and standard deviation come from? How are they derived?

1. X bar and mu are both calculated in the same way (which is by taking the average) but X bar represents the sample mean while mu the population mean.

2. a) Discrete. b) Continuous. c) Discrete if measured in years.

3. The size of bins is important because if it is too large, it will be difficult to effectively analyze the data. If too small, we will lose the purpose of creating a histogram which is to simplify the analysis of larger samples data.

4. What does it imply when it says that median and IQR are robust while mean and std deviation are not?

1. x-bar is the sample mean (the average) while µ is the population mean

2a. Discrete

2b. Continuous

2c. Discrete

3. Histograms are used because it is easier to see certain characteristics in the data (such as skewness, modes, etc.). If the bins size are too small or too large the data won’t be able to be read as easily.

4. Which do we care about more: sample standard deviation or population standard deviation?

1. x and µ are related in that they are both means, but x is the sample mean, while µ is the population mean. The difference is that a sample mean is the mean (or average) of a specific sample of a population, while the population mean would be the mean of the entire population.

2. a) Discrete

b) Continuous

c) Discrete

3. Histograms show data density and variation in the amount of data across different data ranges. If too big of a bin size is chosen, there will not be much variation observed; there may simply be one bin with lots of data and the rest may not have much data at all. If too small a bin size is chosen, the data may fluctuate up and down from one bin to the next, and the histogram may not be showing what it is intended to show (at least not very clearly). So, a bin size somewhere in between should be chosen so that the variation of data can be seen clearly.

4. Is there some kind of rule so you know if the data is well-centered (I guess that’s the best way to put it)? For example could you compare the standard deviation to the mean or the range in some way? Which one would be better to compare to the standard deviation?

xbar represents the average of a specific set of cases in a population, while mu represents the average of all the cases in the population. xbar can cover the whole population, but doesn’t necessarily.

Coin flips are discrete. It takes numerical values in jumps.

The length of a rod is continuous. It can be measured in non-integer values.

Age is tricky, but should be defined as continuous, as time flows continuously, we just like to round.

With histograms, bin size is important, because it separates data into similar sets, without refining it too much. The idea is to get an idea of a trend within large sample sets. Too small of a bin size, and you essentially have a dot plot. Too large of a bin size, and you have no real separation between unrelated data points.

My question refers to the unimodal, bimodal, and multimodal examples. To me, the unimodal diagram could just as easily be described as bimodal. Yes, it only differs from its neighboring bins by a few observations, but it’s comparing an 11 to a 7. Is that not significant?

1) x-bar and mu are similar in that they both represent the mean value, but x-bar specifically represents the mean of a given sample whereas mu represents the mean of an entire population. This means x-bar should approximate mu but with some error involved.

2a) Discrete

2b) Continuous

2c) Discrete assuming it is being measured in integer units of time (ie, 19 years)

3) A large bin size will not show you the skewness of the data (too general) but a small bin size will not give you an accurate representation of the modes of distribution.

4) A lot of statistical analysis seems relative (proper bin size, what is a true outlier and what belongs with the other data, etc). Is there any concrete way to determine these, or is it always open for discussion?

1. X_bar denotes the sample mean while mu denotes the population mean. Both variables signify the sum of all observations divided by the number of observations, but x_bar specifically utilizes only those observations which are in a given sample while mu utilizes every observation in an entire population.

2. (A) Discrete (can’t have 0.5 of a head)

(B) Continuous (length can be any size)

(C) Discrete (assuming we cannot say that one student is 20.2 years old while another is 20.1 years old).

3. Bin size will determine how many bars you get on the histogram. Choosing a bin size that is too “inclusive” of some points will result in a histogram with just a few bars. A lot of information can be lost in this manner. It is important to choose a size so that will ultimately allow for the data visualization of the results in the clearest manner and with more bars.

4. My question isn’t so much about the reading as it is about question 2 in this assignment. Question 2, part C, asks if the ages of randomly chosen Vanderbilt students are continuous or if they are discrete. Wouldn’t it be continuous if we allowed one student to say they are 21.1 years old and another student to say they are 21.2 years old? But wouldn’t it be discrete if we simply rounded the ages to the nearest whole number? This question really needs more information to be sure what the right answer is. How do we address problems like this in the future?

1. Xbar is the average of the data in a specific sample size while mu is the average of the data in the entire population.

2. a) Discrete

b) Continuous

c) Discrete (if measured in years)

3. If the bin size is too large, trends and observations could be grouped together and missed. Since the bin size represents the data density, larger bins group many data points together that may need to be separated in order to see overall patterns.

4. How does one assign a bin size for a histogram that will accurately show the data?

1.

x bar refers to the sample mean, and it represents the average of the observations in a given sample of the population. The greek letter mu refers to the population mean, and it differs from x bar in that it represents the average in an entire population, not just a sample.

2.

a) Discrete

b) Continuous

c) Discrete if measured in purely years, Continuous if age is just measured by time

3.

Bin size is important in order to accurately show the distribution of large samples of data. If a bin size is too large then there may be too many data points within a single bin to properly show any kind of distribution, thereby defeating the purpose of making the histogram. A bin size that is too small may also defeat the purpose of a histogram because there would no clear presentation of the data because of such small ranges resulting in an overflow of bars that are close to the same height.

4.

The book says box plots are used while also plotting “unusual” observations. What kind of unusual observations are they referring to, and are these box plots necessary for any time the “unusual” criteria is met?

1. xbar is the mean of the sample, while mu is the mean of the population

2. a) discrete

b) continuous

c) discrete if you are measuring age in years; continuous if you are measuring age exactly

3. Bin size is important because if you make the bin size too large, you may not be able to see the distribution of the data very well or see the peak points. If the bin size is too small, the graph may be too wide to be able to easily view the data distribution.

4. Why do the whiskers of a box plot extend to 1.5 * IQR?

1. \overline{x} is the mean of the sample while \mu is the mean of the whole population.

2. a) discrete

b) continuous

c) discrete

3. If the bins are too large, the data becomes meaningless, all the information could be grouped to one side of the bin, and this becomes more significant the wider they become

4. none

1. \overline{x} is the mean of the sample whereas \mu refers to the mean of the population as a whole. \overline{x} is a good approximation of \mu if a good (random) sampling is used, but may differ widely if bad sampling techniques are used.

(2a) The number of heads in 100 tosses of a coin. — discrete

(2b) The length of a rod randomly chosen from a day’s production. — continuous

(2c) The age of a randomly chosen Vanderbilt student. — can be continuous, but often taken as a discrete number

3. A bin size too large will place most of the values into a few bins, making it hard to see any trends. A bin size too small will place only a few items in each bin, spreading the data out and making it similarly hard to see trends.

4. What’s one question you have about the reading?

This chapter is pretty basic material, so I didn’t really have many questions. The one item I hadn’t seen was the limit on the whisker length on a box and whisker graph. It makes sense for noting outliers, though I do wonder a little why the threshold of 1.5 was selected.

1. Mu is the average of observations for the total population, while x-bar is the average of observations for a sample of the total population.

2. (a) discrete (b) continuous (c) continuous

3. Bin size is important because the larger the bin, the more general the data is, so it determines the precision of your data analysis.

4. Can you explain Figure 1.17? I would like another step by step of how those graphs are similar and different.

1. x̄ is the sample mean and is calculated as the sum of all the observations divided by the number of observations. µ is the population mean and is calculated the same way but includes all observations in the population as opposed to a sample/part of the population.

2a. discrete. 2b. continuous. 2c. continuous

3. bin size is important because it must be set such that values allocated to various bins are easily seen. if the bin size is too large or too small, few significant differences can be identified in the histogram.

4. while the median is more useful than the mean in extreme observations, which is more useful in non-extreme observations and why?

This is a test comment. I’m leaving it without being logged in.

1. They both represent the mean of a certain population

2. A. discrete, B.discrete, C.discrete

3. Because it defines it changes the precision of the histogram

4. What are histograms most used for?

A)

The X` and u` both are the average value of distribution of data. However, u` does represent all the average of all observations in data. In other side, the X` can represent only a sample of the data (not all the observation)

B)

1) discrete

2) continuous

3) continuous (we are picking randomly. Moreover, the age is scaled by time)

C)

Because of the histogram shows the density of the data, it is very important to have the right size of bin. The size of the bin can be misleading to the viewers. For example, let’s take the car example in the book. If the size of the bin was big, the viewers will think cars are cheap (because they will see all the cars under 40K in one bin and the cars above 40K in another bin. For that, when they compare the bins, they will assume that cars are cheap). Also, When the bins are too small, we will have the a dot plot, which is not good for large data. Moreover, we will lose the density of data (which suppose to be an advantage of histogram plot). Beside that, the plot will be too detailed, which is going to be annoying to the viewers.

D)

I have some issue with the dependency of variables.

1. x(bar) is the sample mean, where ‘mu’ is the population mean. The sample mean will be a good estimate, but not necessarily exact to the population mean.

2. (a) discrete – the number of heads/tails must be a positive integer, no fractions

(b) continuous – the length of a rod could be at any length, even if the length is so small it cannot be measured by standard approaches

(c) discrete (probably) – I say that I am 22 years old, not 22.06872 etc. Although, it could be calculated like this.

3. It is important to categorize the bins to have a ‘natural’ size for the data set. For example, it may be good to have car prices to have an ‘under 10k range’, as that is usually a category cars are sold in. Using these bins will be better for the reader, bringing more meaning and making it easier to read.

4. How much money does something like the US census cost? Is there a good scale for the cost of sample of X people?

1.The variable, x_bar represents the mean or average of the sample. This is from the data we have taken. The variable, mu, is the population mean representing the predicted mean outcome of the entire group on which the sample is based.

2.a) Discrete

b) Continuous

c) Discrete

3.The bin is what holds the different values of the data. The greater the bin size the more values it accounts for and the more general your histogram. If the bin size is smaller it holds more specific values.

4. How does one find the mu value if for most statistics it is nearly impossible to have an “infinite” amount of data points?

1. X bar is the mean of sample while mu represents the average of all observations in population. In most of case, mu is difficult to get. Therefore people usually uses x bar to find the value of mu with some reasonable error.

2. Discrete

Continuous

The third one depends on the unit used. If year is used to calculate the age, then it is discrete. If second is used, then it is continuous.

3. The bin size can determine the accuracy of the describing of data. If the bin size is too large, the error can be very large. If the bin size is too small, the advantage of histograms may not be obvious and histograms become meaningless.

4. How can we determine the best bin size?

1. x is the sample mean of a numeric variable and mu represents the average of all observations in the population.

2. a.) discrete b.) continuous c.) discrete

3. choosing an appropriate bin range is important because selecting too large of a bin size might hide variation in the data and bunch it up. Choosing too large of a bin size might have the opposite effect, causing the data to be spread out and become hard to interpret.

4. what is the “dividing line” between associated and independent variables? How can you tell that slight association is not just coincidental?

1. xbar is used to represent the sample mean and mu is used to represent the mean of the entire population.

2. a. discrete

b. continuous

c. discrete if counting by years

3. The bin size is used to determine how accurate the data is that is being represented. If the bin is too large or too small the histogram becomes difficult to understand.

4. So how exactly do you determine the proper bin size to use?

1. x(bar) represents the average for the sample dataset, whereas mu represents the whole population’s average.

2. a) discrete b) continuous c) discrete(could be continuous, but I think there are about 6000 students in the integer range 17-24)

3. If a bin is too large, it will have all the values in it, and our histogram doesn’t show us anything(or the values in the bin are too varied to be useful), but if it is too small, then we have only one or two items in each bin, and we again get no useful pattern from the histogram.

4. I don’t fully understand standard deviations with tails or skew.

What is the relationship between and ?

Is the given variable discrete or continuous? (a) The number of heads in 100 tosses of a coin. (b) The length of a rod randomly chosen from a day’s production. (c) The age of a randomly chosen Vanderbilt student.

When constructing a histogram, why is the choice of bin size (that is, the size of the range of values that are placed in a single bin) important?

What’s one question you have about the reading?

1) The former is the mean (average) of a set of data (all the cases of a sample), while the latter is the population mean, which is the average of all observations of a population.

2) A – discrete

B – continuous

C – Depends on how you look at age: If you mean age in terms of ” years old” as most people tend to describe themselves then it’s discrete, but one could define age as the total amount of time a person has been alive since birth, or in some other way which would break down age into smaller parts than just integers of years (including months or even days in their stated age for example), in which case it would be continuous. I don’t think the latter was the intent, but I thought I should be specific/clarify just in case.

3) Bin size will affect how closely the histogram represents the distribution of data. Obviously if a data set to be visually represented with a histogram has 1000 cases to represent, but all of these 1000 cases would fall within the range of 5-10 on the horizontal axis’s variable, then choosing a bin size of 10 is foolish, as all data will be lumped into just 1 or 2 columns, depending on where one separates these bins of size 10. But if the bins were instead chosen to be of size .1, the data might end up revealing a bimodal, right skewed distribution.

4) Could you clarify the idea of population mean? It says it’s the average of “all observations in a population” but the way it’s discussed makes it sound like it MEANS it’s the “true average” of the population (ie: if you could take a sample with size equal to the total population you’re examining, then the first mean discussed (x-bar) would equal the population mean (mu)). Is this what is meant or is it different?

1. What is the relationship between and ?

Xbar is the mean of the sample, and mu is the population mean. Generally a sample mean is smaller than a population mean, as a sample is a part of a population.

2. Is the given variable discrete or continuous? (a) The number of heads in 100 tosses of a coin. (b) The length of a rod randomly chosen from a day’s production. (c) The age of a randomly chosen Vanderbilt student.

(a) Discrete (b) Continuous (c) Continuous, or Discrete, depending on how age is defined.

3. When constructing a histogram, why is the choice of bin size (that is, the size of the range of values that are placed in a single bin) important?

A correct choice of bin size is important to arrive at a correct representation of the population. A bin size too small could not take into account enough variation, but a bin size too large may have too many factors affecting the results of the analysis.

4. What’s one question you have about the reading?

No real specific question. Maybe a more thorough explanation of standard deviation?

1. x-bar is the mean of the sample data, while mu is the population mean. In certain cases, depending on how representative of the entire population the sample data is, the sample mean can be used as a valid estimate for the population mean.

2. (a) Discrete

(b) Continuous

(c) Discrete (if in years only), Continuous (if months and days are expressed as decimals)

3. The size of the bin is important because it is direclty related to how much detail and information is conatined in your histogram. With bins that are too large, details such as how skewered the data is may be lost. With bins that are too small, the histogram becomes cluttered and may contain redundant information between bins.

4. Apart from presenting an easy way to view how data is skewered, is there any other use for dot plots?

Xbar is sample mean. Mu is the population average.

Discrete, Continuous, Discrete

If the bin size is too large, trends and pattern of the data would be hard to visualize. If the bin size is too small, a single bin wouldn’t have enough data to visualize.

Xbar is sample mean. Mu is the population average.

Discrete, Continuous, Discrete

If the bin size is too large, trends and patterns of the data would be hard to visualize since they are joint together in the same bar. If the bin size is too small, some bar would have no data therefore makes the graph hard to read

1) Xbar is the sample mean while mu is the population mean. With large populations it is difficult or impossible to find the true population mean so a sample is chosen and thus xbar is supposed to represent the population mean through a sample. The two values will still more likely than not, be different.

2) a) discrete

b) continuous

c) continuous

3) because if you choose an inappropriate bin size then the histogram could possibly show an inaccurate portrayal of the data i.e. show extreme skew when it is not necessarily there

4) is there a method to choosing an appropriate bin size?

1. x bar represents the sample mean, which is what we normally think of as the average (it spans all of the data collected). Mu is the population mean; it is the mean over a specific population within the entire sample (i.e. the average price of all red cars in a collection of cars being studied).

2.(a) discrete, (b) continuous, (c) discrete

3. The size of the bins determines the resolution at which you can see the shape of the data distribution. If there is only one bin, then you’ll have one rectangle in your graph and won’t be able to infer anything about the data. With more bins, you should have better resolution. However, if you have too many bins, then your graph might have large empty spaces in it, which would also make it difficult to draw conclusions about the data. You want a large enough bin size that you don’t have too many empty bins, but that you also don’t have so few bins that you can’t make out the shape of your distribution.

4. On page 3 the book states:

Proportion who died in the treatment group: 41/733 = 0.056.

Proportion who died in the control group: 60/742 = 0.081.

“The death rate was 2.5% lower in the treatment group.”

The above sentence confused me because I normally think of percent difference as difference/original * 100, and .081 is much more than 2.5% greater than .056. In this case the book seems to be using % as a unit. I think I understand this, but it was a little bit confusing.

I really didn’t know that we have homework due on last Friday. I just checked this blog and realized that I just missed a homework. My bad and I promise it won’t happen again.

1) X is sample mean and u is population mean

2a)Discrete

b)Continuos

c)Discrete

3)To make the data to be visualized easily

1. Bar x is the mean selling price and mu is the population mean. They are computed in the same way.

2. a) discrete b) continuous c) discrete

3. If the range of the bin is too large than it would be hard to distinguish and gather data from the relative bin sizes. If the bin ranges are too small then you pretty much have a bar graph. The ranges need to be proportional to the total so that the data density can be seen clearly.

4. The quartile system confuses me.

1. mu generally represents the population mean while xbar generally represents the sample mean.

2. [a] discrete [b] continuous [c] discrete as we perceive age (but time is continuous)

3. It is important because will be a determining factor in data density. Too small of a bin and you will have to many data bars and it will make interpreting it difficult. Too large of a bin and you risk skewing the data to represent too few values. Changing the bin size can skew the data so its important to choose it wisely.

4. Is there a standard that is used for bin size?