Sunday, February 7, 2016

Book Review - How to Lie with Statistics - Darell Huff

How to Lie with Statistics is an amazing book. Educational, thought-provoking, humorous -- this book is all of these and more. That the first edition was printed in about 65 years ago makes it even more awe inspiring. All things that the author explains in the 10 chapters about the different devious tricks people use to mislead with statistics are pretty much relevant and applicable to current day. If you do not have time to read the rest of this post and want one takeaway, it would be to just pick this book right away and read it cover-to-cover.  You will definitely not regret it.

The introduction sets the tone for the rest of the book. It starts with an example where two polls, one by Gallup and another by a newspaper come up with such huge difference in their estimates on how many people are familiar with the metric system in US. One said 33% and the other 98%. The author ascribes this anomaly to the massive sampling bias in the newspaper poll. There are similar such examples in this chapter as well, but there are a couple of statements about statistics that are worth calling out, reproduced below. Clearly, when it comes to statistics, there is usually more to it than what meets the eye.
  • "So appealing in a fact minded culture, is employed to sensationalize, inflate, confuse and oversimplify"
  • "Well wrapped statistic is better than Hitler's big lie -- it misleads, yet it cannot be pinned on you"
The first chapter is titled "Sample with built-in bias". Let us suppose that there are two huge barrels full of beans with two different colours. We need to find out the ratio. Given sufficiently large sampling from these barrels and counting the red and white ones, it is possible to identify the ratio to a reasonable accuracy. But if there is a built-in bias, all bets are off. The author explains this with the following example. Let us suppose that the poll question is "Do you like questionnaires ?" The results of this poll is likely to give out percentages in the high 90s. That is not because a lot of people like questionnaires. It is because the people who dislike it are likely to fling this questionnaire into the nearest dustbin and it will never make it to the response pile. Ignoring people who did not respond is a guarantee for wrong interpretation. The rest of the chapter reinforces this point with several other interesting examples.

"Well chosen average" is the title of the second chapter. This chapter deals with using the word average without revealing whether it is a mean/median/mode and how misleading it can get. The author uses the example of "average income of a neighborhoods" to explain this concept. Assuming that there are a few extremely rich households and a lot of low income households, the average (mean) could be misleading to describe this number. A median or a mode here would be better choice. There is also the urge to assume that distributions are bell shaped. But in reality, quite a few of these are "hockey stick" shaped and so an unqualified average is completely meaningless.

"Little figures that are not there" is the third chapter. The essence of this chapter is this: Well biased samples can be used to produce any result. So can random ones, if some of the underlying information related to the numbers are not published. For example, if the sample size is small enough and you try enough of them, you can manage to get it to produce a distorted figure. For example, on an average, we will get heads half the number of times and tails the other half. But if we just toss the coin only 5 times, is not improbable to get 5H or 4H and 1T. The law of averages will hold good only if we toss the coin sufficiently enough number of times. By masking the sample size, the numbers can be made to confess to anything. There are a bunch of other techniques in this chapter, including playing with words, displaying graphs without labeled axes etc that can be used quite effectively for manipulating the outcome.

Chapter 4 is titled "Much ado about nothing" This chapter explains tricks that people use to mislead where they just rely on some ordering of items, without actually looking at the absolute numbers. The classic example cited here is a study by an independent agency on the tobacco and nicotine contents of various brands. The study concluded with the finding that there is very little difference between the various brands. However, there still was one at the bottom of the list who tried to exploit this by just citing the ranking and carefully not revealing the absolute numbers.

The Gee-Whiz Graph is the title of Chapter 5. When numbers in tabular form are taboo and words do not work, the author mentions that one can always mislead by drawing a picture. Consider line graphs. Let us presume that we need to depict the increase of some entity from 20 to 22. If the scale is big enough, the 10% increase is clearly depicted. But if you want to make it dramatic, just blow up the scale. From 20 to 22 in steps of 0.2 Now, that will appear to be a steep increase for the naive reader.

The one dimensional picture is the title of Chapter 6. Here the devices used are bar charts and pictorial graphs. Say you want to show that country A workers earn twice that of country B. Simple bar chart can be used to show that honestly. However, to make it dramatic, depict a two dimensional picture of it. Now, increase both width and height by 2. Implicitly it will make it appear that area is 4 times instead of just being two times. Make it three dimensional. It will be 8 times.

Chapter 7 is titled The semi-attached figure.  Quoting directly from the book, this chapter's trick is this -- "If you cant prove what you want to prove, demonstrate something else and pretend they are the same thing".  The first example is the following claim by a drug company. Nostrum cures cold,  kills 30K germs in 11 seconds in a test tube. The details are conveniently left out.  Antiseptic that works in test tube might not work the same in humans. It possibly could not be used in the same concentration. No information on the kind of germ that it killed. Here is another interesting example about the enlisting campaign by US Navy. The campaign compared the mortality rates in Navy and outside. Death rate in Navy was 9/1000. Same time, for civilians in NY, it was 16/1000. The campaign then went on to conclude that joining Navy is probably even better than being a civilian. What is not mentioned is this. Navy is full of super fit people. Civilian population includes infants, old, ill etc. The numbers are just not comparable.

Chapter 8 is titled Post Hoc rides again. This is the well known theory "Correlation does not mean causation". The author provides several examples where there may be a correlation, but it is clearly not obvious on which is the cause and which the effect. There are several cases where the cause and effect are confusingly distorted, reversed and intermingled.

How to Statisticulate is the title of Chapter 9.  Statisticulate is "Statistics" + "Manipulate". The chapter covers a bunch of other techniques that can be used to manipulate and mislead. The first tool the author uses here is the map. According to the author, maps expose a fine bag of variables where facts can be concealed and relationships distorted. Then there are percentages and percentiles. Index numbers are also critical for proper representation of data. This chapter has examples how each of the above can be used to paint a completely different picture than reality.

Chapter 10 is the last chapter that is titled "How to talk back to a statistic".  The author suggests five questions that one should ask to prevent from being misled by these devious mechanism.

  • Who says so
  • How does he know
  • What is missing
  • Did somebody change the subject
  • Does it make sense

In essence, this book is not only a great read, but also something one should get for one's library, to refer time and again.