Sunday, May 19, 2019

The Statistician's Error?

I just attended (and gave a talk at) the United States Conference on Teaching Statistics (USCOTS). Big thanks to Allan Rossman, who brilliantly MC-ed the conference.

• One keynote was about "moving beyond p < .05" in a talk by Ron Wasserstein and Allen Schirm, In their recent editorial in The American Statistician (with Nicole Lazar), a primary recommendation was Don't Say "Statistically Significant". Decisions with p values are about controlling error rates, but dichotomous decisions let people slip into "bright line" thinking wherein p < .05 means real and important and p > .05 means absent and unimportant.

• Another keynote, in a talk by Kari Lock Morgan, was about three possible explanations of an apparent effect of a manipulation, namely (i) genuine cause, (ii) random difference at baseline before manipulation, and (iii) random difference after manipulation.

I returned home from the conference this morning. To relax, after the intensive pre-conference preparation and during-conference insomnia, I opened a book of poetry and came across a poem by Aaron Fogel that (inadvertently) reflects upon both talks. It's a poem about how editors of printing make decisions regarding errors, and about three sources or errors, and distinguishing the sources of error. And about the role of editors (and perhaps of statisticians?).

The Printer's Error
Fellow compositors
and pressworkers!
I, Chief Printer
Frank Steinman,
having worked fifty-
seven years at my trade,
and served five years
as president
of the Holliston
Printer's Council,
being of sound mind
though near death,
leave this testimonial
concerning the nature
of printers' errors.
First: I hold that all books
and all printed
matter have
errors, obvious or no,
and that these are their
most significant moments,
not to be tampered with
by the vanity and folly
of ignorant, academic
textual editors.
Second: I hold that there are
three types of errors, in ascending
order of importance:
One: chance errors
of the printer's trembling hand
not to be corrected incautiously
by foolish professors
and other such rabble
because trembling is part
of divine creation itself.
Two: silent, cool sabotage
by the printer,
the manual laborer
whose protests
have at times taken this
historical form,
covert interferences
not to be corrected
censoriously by the hand
of the second and far
more ignorant saboteur,
the textual editor.
Three: errors
from the touch of God,
divine and often
obscure corrections
of whole books by
nearly unnoticed changes
of single letters
sometimes meaningful but
about which the less said
by preemptive commentary
the better.
Third: I hold that all three
sorts of error,
errors by chance,
errors by workers' protest,
and errors by
God's touch,
are in practice the
same and indistinguishable.
Therefore I,
Frank Steinman,
for thirty-seven years,
and cooperative Master
of the Holliston Guild
eight years,
being of sound mind and body
though near death
urge the abolition
of all editorial work
and manumission
from all textual editing
to leave what was
as it was, and
as it became,
except insofar as editing
is itself an error, and
therefore also divine.

Wednesday, March 13, 2019

Bayesian statistics at Princeton University, with a visit to the grave of John Von Neumann

I was very pleased to have had the opportunity to present a talk regarding Bayesian models of ordinal data at Princeton University, on Tuesday, March 5, 2019. An abstract of the talk is here, and the published article on which the talk was based is here.

A big thanks to Dr. Ting Qian who orchestrated the visit wonderfully. The lecture hall was packed, with people standing outside the door in the hall, in no small part due to Ting's organization of a popular statistics series. Big thanks also to the staff who made arrangements. And, of course, thanks to the many people who took time to meet with me while I was there.

I took a few hours the next morning to visit some special places in Princeton. In particular, I visited the graves of John Von Neumann and Kurt Godel. Von Neumann made contributions to Monte Carlo methods at the foundation of MCMC methods in Bayesian analysis.

Here's a snapshot of Von Neumann's grave stone:

And only a few feet away, Godel's grave stone:

After visiting the cemetery, I went to the Princeton University Art Museum and came across this painting by Georgia O'Keeffe. It's titled, "From a New Jersey Weekend II," painted in 1941.
It was an interesting coincidence that both O'Keeffe's and my visits featured two notable grave stones.

Here's a map of the Princeton Cemetery, with a marker at the approximate location of the graves of Von Neumann and Godel.

For posts about the book, DBDA2E, visiting other famous grave sites, see this post and its links to previous posts.

Thursday, February 21, 2019

A Stendhal moment on the way to Bayesian stats class

On the way to my Bayesian stats class this morning I had a few moments of Stendhal syndrome and thought I'd share it with y'all. (Aside from being on the way to Bayesian stats class, it has nothing directly relevant to Bayesian statistics.) The sun was shining through some construction fences lining a sidewalk and were unintentionally beautifying an otherwise routine Thursday morning. Here are a couple of snapshots from my office window:

It reminds me of Christo and Jeanne-Claude's Running Fence:

Sunday, October 21, 2018

Sinusoidal trend and global warming UPDATED

In a previous post from six years ago, I fit a sinusoidal trend, with auto-regressive component, to daily temperature data. (Spoiler alert: It's still getting warmer.) Recently I've received inquiries about the script for that analysis. I disinterred the ancient script, updated it, and grabbed more recent temperature data. The script and data file are linked below.

The result of the new analysis:
As you can see from the plot (above), the slope of the linear spine of the sinusoidal variation is 0.068 degrees Fahrenheit per year. The 95% HDI on the estimate spans zero, just as it did with the smaller data set in the previous post from six years ago. But I'm pretty sure that if this city were put into a big hierarchical model with lots of other cities across the globe, the high-level estimate of slope on the linear spine would be clearly greater than zero.

But evidence for global warming is not the point of this post. The point is to link the full script and data file. Here they are: R script; data. Hope this is helpful.

Wednesday, September 19, 2018

Which movie (treatment, group) is better? Opposite conclusions from different models.

Which movie is better? One way to answer is by considering the star ratings given to those movies. Just treat those 1-to-5 star ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

The analogous structure arises in many situations. Suppose, for example, we ask which group is happier, a group of poor people or a group of rich people? One way to answer is by considering subjective happiness ratings from an ordinal scale: 1 = very unhappy, 2 = mildly unhappy, 3 = neither unhappy nor happy, 4 = mildly happy, 5 = very happy. Just treat those 1-to-5 ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

Or, consider ratings of symptom intensity in different treatment groups. How bad is your headache? How depressed do you feel? Just treat the ratings as numbers and throw them into a t test, and out pops your answer. Right? Not necessarily...

Treating ordinal values as if they were numeric can lead to misinterpretations. Ordinal values do not indicate equal distances between their levels, nor equal coverage of each level. The conventional t test (and ANOVA and least-squares regression, etc.) assumes the data are metric values normally distributed around the model's predicted values. But obviously ordinal data are not normally distributed metric values.

A much better model of ordinal data is the ordered-probit model, which assumes a continuous latent dimension that is mapped to ordinal levels by slicing the latent dimension at thresholds. (The ordered-probit model is not the only good model of ordinal data, of course, but it's nicely analogous to the t test etc. because it assumes normally distributed noise on the latent dimension.)

The t test and the ordered probit model can produce opposite conclusions about the means of the groups. Here's an example involving star ratings from two movies:

The figure above shows data from two movies, labelled as Cases 5 and 6 in the first two columns. The pink histograms show the frequency distributions of the star ratings; they are the same in the upper and lower rows. The upper row shows the results from the ordered-probit model. The lower row shows the results from the metric model, that is, the t test. In particular, the right column shows the posterior difference of mu's for the two movies The differences are strongly in opposite directions for the two analyses. Each posterior distribution is marked with a dotted line at a difference of zero, and the line is annotated with the percentage of the distribution below zero and above zero. Notice the ordered-probit model fits the data much better than the metric model, as shown by the posterior predictions superimposed on the data: blue dots for the ordered probit model, and blue normal distributions for the metric model. (This is Figure 8 of the article linked below.)

Read all about it here:

Published article:

Preprint manuscript:

R code:

Friday, September 7, 2018

Analyzing ordinal data with metric models: What could possibly go wrong? (just published)

Just published:

Analyzing ordinal data with metric models: What could possibly go wrong?
We surveyed all articles in the Journal of Personality and Social Psychology(JPSP), Psychological Science (PS), and the Journal of Experimental Psychology: General (JEP:G) that mentioned the term “Likert,” and found that 100% of the articles that analyzed ordinal data did so using a metric model. We present novel evidence that analyzing ordinal data as if they were metric can systematically lead to errors. We demonstrate false alarms (i.e., detecting an effect where none exists, Type I errors) and failures to detect effects (i.e., loss of power, Type II errors). We demonstrate systematic inversions of effects, for which treating ordinal data as metric indicates the opposite ordering of means than the true ordering of means. We show the same problems — false alarms, misses, and inversions — for interactions in factorial designs and for trend analyses in regression. We demonstrate that averaging across multiple ordinal measurements does not solve or even ameliorate these problems. A central contribution is a graphical explanation of how and when the misrepresentations occur. Moreover, we point out that there is no sure-fire way to detect these problems by treating the ordinal values as metric, and instead we advocate use of ordered-probit models (or similar) because they will better describe the data. Finally, although frequentist approaches to some ordered-probit models are available, we use Bayesian methods because of their flexibility in specifying models and their richness and accuracy in providing parameter estimates. An R script is provided for running an analysis that compares ordered-probit and metric models.

Fig. 4Ordinal mean as a function of latent mean (mu) and SD (sigma). Groups marked Ⓐ and Ⓑ illustrate a false alarm (Type I error) for which the underlying means are exactly equal but the ordinal means are very different. Groups marked Ⓑ and Ⓓ illustrate a miss (Type II error) for which the underlying means are quite different but the ordinal means are exactly equal. Groups marked Ⓒ and Ⓓ illustrate an inversion for which the underlying means have μD > μC but the ordinal means incorrectly have μC > μD.

Published article:

Preprint manuscript:

R code:

Tuesday, May 8, 2018

Just published: "Rejecting or Accepting Parameter Values in Bayesian Estimation"

Just published: "Rejecting or Accepting Parameter Values in Bayesian Estimation" in the journal, Advances in Methods and Practices in Psychological Science.

Abstract: This article explains a decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE) around the null value. The article also discusses considerations for setting the limits of a ROPE and emphasizes that analogous considerations apply to setting the decision thresholds for p values and Bayes factors.

Figure 1 of the article.

From the introduction:
In everyday life and in science, people often gather data to estimate a value precisely enough to take action. We use sensory data to decide that a fruit is ripe enough to be tasty but not overripe—that the ripeness is “just right” (e.g., Kappel, Fisher-Fleming, & Hogue, 1995, 1996). Scientists measured the position of the planet Mercury (among other things) until the estimate of the parameter γ in competing theories of gravity was sufficiently close to 1.0 to accept general relativity for applied purposes (e.g., Will, 2014).

These examples illustrate a method for decision making that I formalize in this article. This method, which is based on Bayesian estimation of parameters, uses two key ingredients. The first ingredient is a summary of certainty about the measurement. Because data are noisy, a larger set of data provides greater certainty about the estimated value of measurement. Certainty is expressed by a confidence interval in frequentist statistics and by a highest density interval (HDI) in Bayesian statistics. The HDI summarizes the range of most credible values of a measurement. The second key ingredient in the decision method is a range of parameter values that is good enough for practical purposes. This range is called the region of practical equivalence (ROPE). The decision rule, which I refer to as the HDI+ROPE decision rule, is intuitively straightforward: If the entire HDI—that is, all the most credible values—falls within the ROPE, then accept the target value for practical purposes. If the entire HDI falls outside the ROPE, then reject the target value. Otherwise, withhold a decision.

In this article, I explain the HDI+ROPE decision rule and provide examples. I then discuss considerations for setting the limits of a ROPE and explain that similar considerations apply to setting the decision thresholds for p values and Bayes factors.
The published article is available here ( and a pre-print version, with some differences in details, is available here (