statistics Archives - rweber.net

Math(ish) quotations

Rebecca — Mon, 02 Jun 2014 12:00:30 +0000

Far from carefully vetted for accuracy.

We must be careful not to confuse data with the abstractions we use to analyze them.
William James

A judicious man looks on statistics not to get knowledge, but to save himself from having ignorance foisted on him.
Thomas Carlyle

After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, “Lies – damned lies – and statistics,” still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of.
Leonard Courtney

The general who wins the battle makes many calculations in his temple before the battle is fought. The general who loses makes but few calculations beforehand.
Sun Tzu

Like the ski resort of girls looking for husbands and husbands looking for girls, the situation is not as symmetrical as it might seem.
Alan McKay (via the unix program fortune, as I recall)

The final mystery is oneself… Who can calculate the orbit of his own soul?
Oscar Wilde

He uses statistics as a drunken man uses lamp-posts… for support rather than illumination.
Andrew Lang

How dare we speak of the laws of chance? Is not chance the antithesis of all law?
Joseph Bertrand

What would life be without arithmetic, but a scene of horrors?
Rev. Sydney Smith

The post Math(ish) quotations appeared first on rweber.net.

Excel for basic statistics

Rebecca — Mon, 28 Apr 2014 12:00:13 +0000

This is a page I made for my statistics students in Spring of 2010. I can’t guarantee it’s perfectly accurate with a current version of Excel because I don’t have Excel on my current computer, but I assume it should still be very close.

In all of the below, “data” (“x-data”, “y-data”) stands in for giving an explicit cell range as reference, such as A2:A41. These commands are not case-sensitive; =min(A1:A10) will give the same result as =Min(a1:a10) and =MIN(A1:a10).

Average: the arithmetic mean may be obtained by the command =average(data).
–
Bins: see frequency.
–
Cell referencing: see standard units.
–
Chi-squared test: The Excel function for the chi-squared test actually returns the P-value, as a decimal. If you are testing a distribution, your actual and expected values must each be in a single row or column. If you are testing for independence of a variable with m values from a variable with n values, your actual and expected values must be in m by n tables where corresponding positions in each table contain values for the same pair of values (i.e., if you labeled the rows and columns, the labels would be identical between the two tables). Excel needs this formatting to determine the correct number of degrees of freedom. The command itself is =CHITEST(observed frequencies, expected frequencies). Example: if you had observed data in the first 5 cells of the first column, and expected frequencies next to them, the command would be =chitest(A1:A5,B1:B5). If you had a 3×3 table where the observed values were in the second, third, and fourth rows of columns B-D and the expected values were in the same rows of F-H, the command would be =chitest(B2:D4,F2:H4).
–
Continuing series of values: See also standard units. If you would like a column of evenly-spaced values, such as for frequency bins, Excel can fill them in for you. Put the first two values in and highlight those two cells. Click on the lower right-hand corner and drag downward; Excel will continue the pattern you began. There should be a little yellow tag that tells you the value that will be in the current bottommost highlighted cell so you know where to stop. This is very convenient but be careful because it will also overwrite data – if you highlight one cell in a column of values and drag the lower right corner, it will replace the cells you highlight with that original value.
–
Correlation coefficient: To compute the correlation coefficient in Excel, your data must be in columns (or rows) and the points must be in the correct corresponding order for variables x and y. The command is =correl(x-data, y-data). The data need not be in standard units; it will not change the outcome if it is.
–
Frequency: See also continuing series of values. If you originally have just a list of outcomes and you need to count the appearences of each value or range of values, you want the frequency command. For example, if your data is heights of plants that range from 1 to 5 inches, you might want to count inch-wide ranges: 1-2, 2-3, 3-4, and 4-5. The first step is to make a list of range delimiters called bins. These will be the top end of each range; here you would want 2, 3, 4, and 5. Excel automatically includes the top value, so these bins will give you the half-open intervals (-infinity, 2], (2, 3], (3, 4], (4, 5]. (Note that if your data only takes on certain specific values this means the counting will be appropriate if you use those values as your bins: if instead of measurements ranging from 1 to 5 you only got the integer values 1 through 5, you would use bins 1, 2, 3, 4, 5 to count them.) Highlight a column of blank cells as tall as your column of bins, type
=frequency(data range, bins range)

and do the special array-function version of “enter”. On a Mac this is command-enter (⌘-enter) and on Windows it is ctrl-shift-enter. For example, if you have 5 bins in cells B2 to B6, and your data is in A1 to A50, it makes the most sense (for later histogram creation) to highlight C2 to C6. The command is then
=frequency(A1:A50,B2:B6) (⌘-enter / ctrl-shift-enter).
–
Highlight-and-drag: see continuing series of values and standard units.
–
Histograms: See also frequency. To make a histogram, create one column of x-axis values (events or outcomes) and make the next column to the right y-axis values (frequencies or percentages). Leave the cell above the left column blank and put the title of the histogram in the cell above the right column. Highlight both columns including the top row described in the previous sentence, and click the “charts” button or option in the “insert” drop-down menu (depending on your version of Excel); follow the instructions. If you do not have the blank upper left cell, Excel will interpret your columns as two paired variables and give you a chart with pairs of columns. If you want to change the color or separation of the columns, double-click on one of them once the chart has been made.
–
Maximum: the largest value of a data set may be obtained by the command =max(data).
–
Mean: see average.
–
Median: the median may be obtained by the command =median(data) or =quartile(data, 2).
–
Minimum: the smallest value of a data set may be obtained by the command =min(data).
–
Mode: the mode may be obtained by the command =mode(data). Note that in Excel only the value which reaches maximum count first is reported as the mode, so the data set {2,2,2,3,3,3} will have mode 2 and {3,3,2,3,2,2} will have mode 3 when in fact it is the same data set, bimodal with modes 2 and 3.
–
Quartile: the quartiles of a data set, the values marking the 25th percentile, 50th percentile (i.e., median), and 75th percentile, may be obtained respectively with the commands =quartile(data, 1), =quartile(data, 2), and =quartile(data, 3).
–
r: see correlation coefficient.
–
Range: see minimum and maximum.
–
Regression line: Once you have created a scatter plot, click the dots to highlight them, and go to the “charts” drop-down menu. There will be an option reading “Add trendline”, and you want a linear trendline/regression.
–
Scatter plot: Have your data in two columns, with the x-axis on the left and the y-axis on the right, corresponding values in side-by-side cells. Highlight the columns (the top row can be the first data point, or it can be labels), click the “charts” button or option in the “insert” menu (depending on how your Excel looks), and choose the option labeled “XY (Scatter)”.
I do not know any easier way to swap which variable is on the x-axis and which on the y-axis than to swap your data columns. If you have variables A and B and want to be able to plot with each one playing the role of x, it might be easiest to make three columns, A, B, and A again, and then highlight either the first two (putting A on the x-axis) or the second two (putting B on the x-axis) to make the scatter plot.
–
Standard deviation: Currently we are only using what is called the population standard deviation, =stdevp(data).
–
Standard units: See also continuing series of values. If you want to put all the data into standard units, there are two ways to take advantage of Excel’s highlight-and-drag feature (see continuing series of values). If your mean is in B1, SD in B2, data in column A, and you simply put “=(A1-B1)/B2” when you drag Excel will change not only A1 but B1 and B2. Instead, you can either replace B1 and B2 with the numerical values of the mean and SD, or, if those values are complicated, you can put “=(A1-$B$1)/$B$2”. The dollar signs tell Excel not to change those reference cells. In either case when you highlight that cell and drag the bottom right corner downward, A1 will change but the rest will not.
–

The post Excel for basic statistics appeared first on rweber.net.

Hazards of correlation and regression

Rebecca — Thu, 24 Apr 2014 12:00:17 +0000

1. Drawing causation conclusions
Ski and snowboard sales tend to rise and fall together, but sales of one don’t lead to sales of the other – they are both tied to an outside factor. Throughout grade school, mathematical skills correlate positively to height. Height doesn’t make you good at math, but older children are on average both taller and better at math than younger children.

2. Ecological correlation
I.e., averaging subsets of data and then correlating the averages (such as comparing population traits after averaging by nation). This can overstate correlation because outliers have been averaged away before the correlation happens.

3. Regression fallacy
If you take data on a population twice and the correlation between the two data points is not 1, individuals with exceptionally high measurements in round 1 will tend to decrease in round 2 and those with exceptionally low measurements in round 1 will tend to increase (“regression toward the mean”). The fallacy is in attributing this to anything other than plain mathematics. One way to think about it is that if you are in the top ranks in the first measurement and you don’t have the same result on the second, there’s a lot more room to move down than up.

4. Application problems
Applying a technique to a data set that doesn’t meet the criteria for that technique to be applicable will at worst completely lie to you about the data. Choosing the wrong variables to compare might not give you any mathematical errors, but it could mask what’s “really going on.” For example, perimeter and area of rectangles are strongly correlated, but the real story is about each of those relating to length and width.

The post Hazards of correlation and regression appeared first on rweber.net.

Pieces of Posts

Rebecca — Thu, 20 Mar 2014 12:00:13 +0000

From the Editor in Notices of the AMS Vol 43, Issue 10 (October 1996) was an essay on the need for mathematical literacy in the general public, beginning with Nixon’s use of the third derivative.

A beginner’s guide to quadric surfaces (pdf), by me, for my calculus students and intended to supplement the textbook rather than be used as a replacement.

I found the first paragraph of this article on seat belt use to be a great example of awkward statistics writing.

Chapter 23 of The Art of Public Speaking by Carnegie and Esenwein is a thorough exploration of logic in debate.

The post Pieces of Posts appeared first on rweber.net.

Risk and Statistics

Rebecca — Thu, 06 Mar 2014 13:00:26 +0000

The risks associated with medicines and medical procedures are rarely conveyed clearly. I remember learning about intra-uterine devices in high school health class and the fact that uterine puncture was a risk with them, which sounds awful and terrifying. However, there are two vital pieces of information omitted in that statement: first, how high is the risk? 1 in 1,000 is very different from 1 in 1,000,000. Second, what are the consequences? “Antibiotics and possible hospitalization” is very different from “hysterectomy and possible death.”

You can turn this lack of clarity to your advantage if you are an evil marketer, pitching either in favor of or against the product. Here’s an exercise:

Suppose a medication for condition X has the side effect of raising a person’s risk for disease Y. On average, the risk increases from a 1% chance to a 3% chance. Represent this data in a way that
a) makes it seem like a terrible side effect,
b) makes it look like a completely dismissible side effect, and
c) gives full disclosure on the risk.

The post Risk and Statistics appeared first on rweber.net.

Mathlinks: Statistics Edition

Rebecca — Mon, 09 Dec 2013 13:00:06 +0000

I had enough links from teaching statistics to make their own list. Actually several lists.

Textbooks and other comprehensive approaches

Online Statistics Education, an online openly licensed statistics text with labs; outgrowth of the Rice Virtual Lab in Statistics
StatLib, a statistics community database
The Little Handout of Statistical Practice, online textbook

Sources for statistics

Guide to sources of statistics from the US Census Bureau
UNdata, international statistical data
Facebook demographics vs. US demographics (April 2010)

Presentation of statistical information

Graphic Sociology: evaluating visual presentation of data
Information is Beautiful: data journalism in the form of visualization
ChanceWiki for Chance News, reviewing news stories involving probability or statistics
Junk Charts and Numbers Rule Your World, discussion of visual and written communication of statistical topics, respectively
Unclear statistics explanation in a news article

Et cetera

Why n-1 in the formula for sample standard deviation?, by Stephen A. Book
Correlation is not causation from xkcd
Statistics and dating from xkcd
Stochasticity episode from Radiolab

The post Mathlinks: Statistics Edition appeared first on rweber.net.

Regression Question

Rebecca — Mon, 25 Nov 2013 13:00:09 +0000

For a statistics exam, I wrote a question about the regression effect in application that I quite like. Here it is, with the answer this time.

Q. A lab assistant in charge of measuring how long it takes for rats to run a maze predicts that usually a rat will take less time on its second time through. A colleague of his, however, notes that the rats ought to regress to the mean maze running time, on average. Are these predictions compatible? Why or why not?

A. They are compatible. The lab assistant is predicting overall improvement, and his colleague is predicting the regression effect (if somewhat loosely stated). The key to their agreement is that we would expect the rats to (on average) have less extreme second maze running times relative to the second run’s mean and standard deviation than their first run time was, relative to the first run’s mean and standard deviation. If the second maze run has a lower mean time we could see general improvement of times without losing the regression effect (which of course is a statistical fact unless |r|=1, which seems unlikely). If the first run is assigned to the x-axis and the second run to the y-axis, visualize the graph of individual rat runtimes lying completely below the line x=y.

Note there is no regression fallacy here. Neither person is proposing an explanation for anything, just a guess at what the numbers will look like.

The post Regression Question appeared first on rweber.net.