How to Not Misunderstand Correlation

Why a coefficient of 0 does not mean ‘no relationship’

Bex T.
Towards Dev

--

Photo by Thirdman on Pexels

Become One With the Data

Are the sales of ice-cream related to weather? Do loud songs tend to be more popular? Do you weigh more as you age? Does the number of storks in your town correspond to more new-born babies?

These are the type of questions you may often come up with when you explore a new dataset. While it may be tempting to answer these questions by human instinct and experience, you have to turn to the loving arms of statistics to validate your assumptions.

There is a commonly used metric called correlation coefficient which lets you get a clearer picture of how one variable relates to another. You will often use it when you are doing bivariate analysis (between two variables) to find out if there is a linear relationship.

This post is all about interpreting and computing these coefficients and how to avoid common pitfalls that come with them.

Get the best and latest ML and AI papers chosen and summarized by a powerful AI — Alpha Signal:

Setup

Scatterplots Masterclass

Before we start learning about correlation coefficients, it is important that you know how to plot the perfect scatterplot. Correlation and scatterplots often come hand in hand as they help confirm each other’s results.

One of the most common and easily understood plots is a scatterplot. They may be easy but you might be surprised how difficult it is to build one which perfectly captures the relationship between variables. As an example, I will load the historical Olympics dataset that contains data of athletes from 1896 to 2016:

png

Scatterplots work very well to show the relationship between two variables, especially numeric. Traditionally, we would use pyplot's scatter or seaborn's scatterplot function to generate scatterplots. However, pyplot provides a much faster way of plotting them for large datasets such as ours.

We will plot the height versus weight using pyplot.plot by setting the marker string to o which plots the data points as dots:

png

From the plot, we can see that taller athletes tend to weigh more. However, as there are more than 200k data points, the plot is overplotted. As more and more data points pile on top of each other, it is hard to say if there are many or few points around the same area. To improve this, we will use transparency using alpha parameter of plt.plot:

png

Even though we increased transparency, the plot still looks overplotted. To further improve, we will decrease the marker size using markersize parameter:

png

The plot looks much nicer but now, we can see that the heights are grouped around as columns. The reason might be that they were reported as inches and then converted to rounded, integer centimeters.

To get around this problem, there is a technique called jittering. It is a method to generate fake data as random noise to improve plots like above.

If we were using seaborn, we would use x_jitter or y_jitter parameters, but matplotlib does not have them. To solve this, we will use np.random.normal function which generates a normal distribution with a given mean and standard deviation. In effect, we are filling in for the values that got rounded off.

So as to not affect the real data, we will create a normal distribution with a mean of 0 and a standard deviation of 2 (arbitrary choice, but needs to be small). Then, we will add this distribution to the athlete heights:

png

Now that we got rid of columns, we can see that weights also group around as rows. We will perform the above operation but with a lower standard deviation:

png

Finally, we have a plot that is close to perfection. The final step would be to zoom in on the main cluster of points:

png

The axis parameter takes a list of 4 values which are minimum x, maximum x, minimum y, and maximum y respectively.

To see the effect of our latest operations, you can compare the final version with the initial plot:

png

It is very important to get the scatterplots right. Before you move on to regression and further steps, you have to do your best to understand the relationship between the variables of interest. Especially, finding out whether the relationship is linear or not by looking at scatterplots is crucial to understand the correlation coefficient which we will discuss next.

Pearson’s Correlation Coefficient

In the previous section, we learned how to visualize the relationship between two variables. In this section, we will learn a metric that quantifies the strength of these relationships. In statistics, this metric is called Pearson’s correlation coefficient. It takes values between -1 and 1.

The higher the absolute value of the coefficient, the stronger the relationship. The sign of the coefficient indicates the direction:

image.png
Image by Wikipedia

A positive coefficient means that as the x variable increases, the y variable increases whereas a negative coefficient represents a pattern where y decreases when x increases.

Visually, Pearson’s correlation coefficient tells us how close the data points are to each other. In the first row of the above plot, you can see the correlation of varying strengths.

However, it is important that you do not confuse correlation with slope. Consider the tricky examples of this in the second line, where the correlation is the same but the slope is different.

Moreover, Pearson’s coefficient only captures linear relationships. If the trend does not resemble a line it does not matter how close the data points are. You can see such examples in the final row of the image.

Generally, there are three categories of correlation:

image.png
Image by author

The correlation coefficient is usually denoted as a lower-case r. For the curious, here is the formula that is used under the hood by software packages to compute Pearson's correlation:

image.png

If you did your homework on statistics, you will know that the formula is covariance of x and y over the product of their standard deviations. (By the way, you can deduce the reason why the central plot in the earlier image which is a horizontal scatterplot has no correlation. Having a completely horizontal or vertical trend means that one of the variables is constant so the standard deviation is 0, i. e. you cannot divide by zero).

Moving on to code, the correlation coefficient can be computed using the corr method on a dataframe. We will only see the coefficients for age, weight, and height columns of the Olympics dataset:

>>> olymp[['Weight', 'Height', 'Age']].corr()
png

The result is a correlation matrix that shows the correlation coefficients of individual pairs of three variables. By interpreting the results, we can see that height and weight are highly correlated with a coefficient of 0.8. However, the relationships between age and weight as well as age and height are weak (0.21, 0.14 respectively).

However, you should never, ever conclude about the relationship between variables by just looking at the correlation coefficient. Make sure that your assumptions are correct by looking at them visually with the help of scatterplots or in some cases, using boxplots or violin plots.

For example, we can double-check the correlation between age and weight using a scatterplot. The raw plot won’t give us much insight so I will make a few changes using the techniques of the previous section:

png

The data points are more spread out and with a slightly positive slope, which does suggest a lower correlation (0.21).

Coefficient close to 0 does not mean ‘no relationship’

After you compute the correlation matrix, make sure to visually check your assumptions about the coefficients before you move on. One of the common pitfalls of correlation is that if you compare two variables that do not have a linear relationship, you will get a coefficient very close to 0.

Instead of jumping to the conclusion that the variables are not correlated, create a scatterplot or a similar diagram that shows the general trend. The reason is that correlation close to 0 might conceal a strong non-linear relationship.

If you have a correlation close to 0 but your visual reveals a curvy trend this means the relationship is non-linear.

--

--