Monday, March 23, 2015

The impact of outliers on the arithmetic mean (or, do people like this book?)

Consider these ratings of a target item (1 to 5 stars):
Based on these ratings, what is your impression of the item? Kinda so-so? Maybe look elsewhere? That's the power of outliers on the arithmetic mean: A few outliers can really pull a mean away from the bulk of the responses. It takes a ton of ratings in the mode to counteract only a few outliers.

These are real data, of course, namely from DBDA2E on The 1-star ratings have comments that clearly state that they are not rating the content of the book, but still they are 1-star ratings that have a lot of impact on the mean. If you think the mode needs bulking up, you know what to do! :-) And if you have had issues like the 1-star raters have had, please let me know so we can attempt to rectify any problems. (By the way, go here for a link to a discount on the book.)

In general, how can we analyze data that have outliers? One way is describing the data by using a heavy-tailed distribution, which DBDA2E explains extensively in Chapters 16 and 17 (and ordinal data analysis is treated in Chapter 23).

BTW, here's the R code I used for making the graph:

x = c(1,2,3,4,5)
y = c(2,0,0,2,8)
plot( x , y , type="h" , lwd=70 , lend=1 , col="gold" , xlab="Stars" , ylab="Frequency" , main="Ratings" , xlim=c(0.5,5.5) , ylim=c(0,9) , cex.lab=1.5 , cex.main=1.5 )
text( sum(x*y)/sum(y) , max(y) , bquote(mean==.(round(sum(x*y)/sum(y),2))) , adj=c(1,1) , cex=1.5 )


  1. I'm the only 1-star rater so far. I get what you mean with outliers, and I think the rating system at Amazon should be revised. Prospective buyers have the right to know the expected quality of the content, as well its expected condition. But pooling the ratings like Amazon does is unfair to authors. It's not YouTube. Condition and content ratings should be separate because they matter separately.

  2. It's absolutely true that prospective buyers have the right to know about issues like Amazon's Kindle errors and Amazon's print-on-demand quality problems. And it's genuinely admirable that readers (like you!) will take the effort to alert other prospective buyers, to both the great and not-so-great aspects of a product. The issue is only in what star rating to give when pointing out a physical condition issue introduced by Amazon. It is not necessary to flag a problem by giving a 1-star rating. For example, there is a review of the Bayesian book by Gelman et al. that says in the review title "Don't buy the Kindle edition" but also gives the book 5 stars. There is another review of the Bayesian book by Gelman et al. that says in the review title "My print had some legibility problems" but also gives the book 4 stars. Those reviewers very effectively alerted other prospective buyers to Amazon's condition problems without trashing the rating of the book. In any case, thank you for buying the book! I genuinely hope that the book serves you well. I think it's the most accessible introduction to applied Bayesian analysis that's presently available.