=== Kevin's Gemini space ===

What is statistics actually for?

For most of my working life I have been an experimental scientist or an engineer of some sort. In this capacity I got used to dealing with statistical methods and, over time, developed a feel for what you could do with statistics. I always assumed that other people involved in similar work had the same view of statistics that I did: statistical analysis was a tool that could be used to find out interesting and important things. Over the years, I've come to see that I was wrong. A huge number of people either (a) have no idea what statistics is about at all, or (b) assume that it's a set of mechanical procedures that you have to apply to see if your data makes sense. But this is not at all what it's all about.

There are a articles on my Web site about statistics and experimental design; sometimes people read these articles, and come to believe -- often wrongly -- that I can help them with their statistics problems. The reason I can't help readers with their statistics is mostly because they haven't grasped the idea that statistics is _for_ something, that it has a purpose. I blame the editors of scientific journals for this; they seem to think that nothing is publishable without a ream of incomprehensible Kolmogrov-Smirnov test results, or something equally impressively daunting. Whatever the reason, the idea has grown up that statistics is something you do if you have to, but otherwise is to be avoided. This idea is closely allied with that other trite platitude: 'you can make statistics say anything you like.' Maybe you can make statistics say anything you like to some people, but you can't make it say something it shouldn't to _me_.

Anyway, in this short article I want to explain what statistics is actually for; I want to try to explain the single most important thing about statistics. There's no maths, Greek symbols, or jargon, just -- I think -- common sense. It doesn't matter whether you know and care about statistics, or not. If fact, you can spend years studying t-tests and confidence intervals, and still not grasp the core philosophy of the subject.

The purpose of statistics is not just important to statisticians, or scientists, it's important to _everyone_. Even if you will never use statistics in your life, and have no interest at all in the subject, and can barely count to ten on your fingers, then -- I respectfully submit -- you still need to understand what follows. If I can get this message across, then even if I achieve nothing else in my life -- and, let's face it, that's looking increasingly likely -- I will still have left the world a slightly better place than I found it.

In short, statistics is important because we can use it to _find out whether something we observe can be applied to new and different situations_. Knowing this allows us to plan for the future, and to make decisions about how to allocate our scarce resources of money, energy and, ultimately, life. In statistics we use the term 'generalisable': an observation is generalisable if it can be used to predict what will happen in new and different situations. If it is not generalisable, it can't. So what is statistics for? It's for determining whether an observation is generalisable or not. It's as simple as that.

Doesn't sound all that earth-shattering, does it? Let's try to illustrate it with an example. Some time ago I overheard a conversion in a pub. I can't remember the exact words, or the names of the people involved. The conversation was about smoking and its effect on your health. As near as I can remember, one chap (let's call him Bob) said something like this:

''I think it's all nonsense, that smoking kills you. It's not as bad as they say. I mean, look at my family: my dad had four brothers, and they all used to smoke four packs of gaspers a day. And the youngest of them is now eighty. Now, my friend John, he's just had to have a lung removed, because it was full of cancer. He's never smoked in his life. It just goes to show, doesn't it? All those doctors haven't got a clue.''

Now, what's wrong with Bob's statement? If you have never been exposed to statistics or experimental science, maybe you're thinking: 'there's something in what he says -- I have an uncle/grandmother/sister who smokes like a chimney and is as fit as a fiddle at age ninety.' If you are, or have been, a statistics user, perhaps at university or in your work, perhaps you're thinking: 'He hasn't defined a null hypothesis, and in any event how am I going to work out a p-value from that?' Both of these standpoints, while perhaps correct, miss the point entirely. Now, before you jump to any conclusions, I'm not anti-smoking. I don't care whether you smoke or not; why should I? My point is simply this: if you choose to smoke, you need to make your own mind up about the consequences. And the _worst_ way you can do that is to listen to someone like our friend Bob. Why? Because his observation _does not generalise_.

The sorry fact is that we don't care all that much about Bob and his family and friends, unless we are personally acquainted with them. Sure, we don't wish them actual harm; we wouldn't, I imagine, drive past their house without stopping if it were on fire. We wouldn't steal their last penny to prop up a wobbly table. But, in the final analysis, what we really care about is _our family_, _our friends_, and _ourselves_. So the most important question you need to ask in relation to Bob's statement is this: ''how does this affect me and mine?'' It's as simple as that. The problem is that it is _impossible_ to answer that question: nothing that Bob has observed has any bearing on you and yours. Here is why it doesn't.

The group of people he has observed: his father and uncles, and his fried John, are not _representative_ of you or I. Suppose you are black, suffer from asthma, have a family history of heart disease, and are generally fit. Bob's family is white, have no history of heart disease, and lead sedentary lives. John is Indian, and is a mountaineering instructor. What reason do you have for thinking that Bob's observations apply to you? Maybe the differences between Bob's family and yours are significant, when it comes to the effects of smoking. Perhaps they aren't. We just don't know.

A properly-designed experiment, with sensible statistical treatment, would allow us to tease out these different factors, and determine which are relevant and which are not. Alternatively, we could include people of different ethnicity, gender, age, occuption, etc., so we get a more realistic sample of the population as a whole. _Then_ we would be getting towards a result that generalises. _Then_ we would be able to judge, with some confidence, whether we ought to take the risk of smoking or not. Without doing this, you may as well plan for the future using a ouija board.

Right, rant over. Thanks for listening.

[ Last updated Tue 22 Feb 19:27:12 GMT 2022 ]