232 - Causation

Correlation and causation are often confused, as if there is a substantial fraction of the population that assumes that any correlation must have cause.

Establishing a correlation is relatively easy. Many with an A-level in Statistics or equivalent have known at some time how to prove this. Without knowing all the maths it is not difficult, explained in the separated box.

One way show a correlation:

Put your data into Excel in two columns, preferably adjacent columns.

Select your data and create a scatter graph.

Ask Excel to plot a line of best fit (which it may well do automatically).

Enquire from Excel the numbers for that line.

Example below.

Here is an example I made this morning.  I found a correlation of 0.888 between the divorce rate and smoking – both are rising. I do not need to know any stats, only how to use Excel, and you can flounder through that once you’ve seen what I produced – choosing the chart type, changing the scales, labelling the axes – none of this is difficult.

The nearer the R2 is to one the greater the correlation (it measures how close the data points are to the line in a clever way)1. Interpretation depends on context and purposes (wiki).

Y-axis data from https://www.infoplease.com/us/marital-status/marriages-and-divorces-1900a2012

X-axis data from https://www.cdc.gov/tobacco/data_statistics/tables/trends/cig_smoking/index.htm

For other correlations that might make you grin—or wince—see sites such as this one and this one. Mark Wilson shows powerful correlation between divorce in Maine and the (Maine implied) consumption of margarine, and then between the age of Miss America and a particular class of steamy murders. Simply because he can. The second site correlates death with unlikely events  – the sort of thing I’d have set as a homework if still in daily teaching2. Go on, you produce one and share it; I’ll add it in here.

You do not seriously think these things are connected, do you? If you did, which way round would you express the connection? Is it the divorce that causes the smoking, or the smoking that causes the divorce? Is it the margarine that causes divorce in Maine or the other way about? As for Miss America....

The lesson is that Correlation does not imply Causation.

Even something that short is too difficult for some, so, if you need to, sit and say it to yourself a few times. I think I’d get a Stats class to do this primary school activity only after they’d shown they still were falling for confusing the two. it is not true to reduce this to “Correlation is not causation”. Wikipedia quotes Edward Tufte in suggesting that we need to add “but it sure is a hint”.

The causes of the logical error lie in how we view A~B, indicating a connection. Using the symbol => for ‘causes’ and hence <= for ‘is caused by’ the cases include (I hope to have been exhaustive):

A=>B                       Speed on the road causes death. No; excessive speed causes death rather than injury if/when an accident occurs.

A<=B                        As wiki says, economic growth slows when state debt rises over 90%.        Deciding that high debt causes slow growth is wrong. We can argue that slow growth causes growth to increase.

A<=>B                      either at the same time or in turn (A makes B happen, which makes A happen...) E.g. Pythagoras works, so it is a right angled triangle. Since it is a right-angled triangle Pythagoras applies.

C=>A  and C=>B    consequences of a common cause. Spring causes flowers to bloom and storks to fly. That does not mean that the storks affect the flowers. Or the flowers, the storks. The wiki article has good examples that typify the sort of thing ‘we’ think we have read: Sleeping with one's shoes on is strongly correlated with waking up with a headache; Young children who sleep with the light on are much more likely to develop myopia in later life; Since the 1950s, both the atmospheric CO2 level and obesity levels have increased sharply.  I do hope you didn’t read causation into those statements.

A=>C and B=>C      They both cause C to happen. They may not even be correlated.  Starvation causes death; War causes death. War and starvation may cause each other sometimes, but not always..

A=>C=>B                  C is an intermediate state. Too often I have seen and heard arguments, where the connection is correlation A to C and C to B, that A causes C. Unlikely to be true.

And there is coincidence, where the original premise A~B is untrue. See the mountain-shaped graph at the foot of this page.

Particularly, let us be clear that A causing B is not automatically exhaustive (all A cause all B) It may well be true that all A cause B, but that doesn’t mean the only way for B to occur is through A happening.

A tends to produce B is a probabilistic statement such as we find increasingly in medicine, along the lines of ‘obesity does not correlate with operations success’’. It is not that the op fails, but it is less likely to succeed. In a climate where measures of success demand value for money, there comes a point at which the marginal nature of the success is too low for the resources to be expended. To put that another way, the expected total expenditure (consumption of resource) is reduced by approaching such a problem from a different direction, such as reducing the body size first – in effect saying ‘This other problem needs to be dealt with first’.

What really makes causation? It occurs pretty directly in a lot of science (or at least that is the logic we apply) using mechanisms local and indirect to explain why it is that A causes B. We pay great attention to the apparent exceptions and use those to inform ourselves of how the previous model needs correction. We all (supposedly) know Pythagoras’ Theorem3, but too many of us forget to apply it only to right angled triangles and, even among those who do, there is a further assumption that the two-dimensional space is flat – not a curved surface.

Which makes me wonder if the root cause of this fault is to do with the way we teach mathematics. The place where we apply cause as a rule and depend upon it for arguing that this means that. And then we spend ages making sure that the proof is necessary and sufficient.

Perhaps this is why we should fight to keep ‘proof’ within the curriculum?  Perhaps instead we need cross-curricular use of terminology – I’d like to hear the same language used in a Literature lesson for example, or in History.

DJS 21070629

top pic from Oregon State

1 There’s a test to  establish ‘confidence’, based on the number of data points. For this sort of case I have 16 pairs of data so 14 degrees of freedom, so anything above around 0.6 is really significant (search for a Pearson’s correlation coefficient table and read it for yourself). Excellent image from here shows how correlation becomes stronger with more data.

Go on, you produce one and share it; I’ll add it in here.  Do as I did, give the sources. You don’t have to give me the data. I used a screenshot to capture the image, which is (I have found) the quickest / easiest way to add to my webpages. Making a spreadsheet ‘live’ on a webpage is still beyond me. By which I mean it is more work that I can justify, even with allegedly having nothing to do all day.  If you are amused and only want to see more such examples, Google ‘correlation and causation’ and choose images.

3 That no matter how many vestal virgins there are in the temple, they will disagree. Anyway, Pythagoras probably preferred boys. “Archimedes was monogamous, not Pythagoras. And no, I didn’t know either of them personally. I’m not that old. Really.”  Quote from lessons across many years.

© David Scoins 2017