This article aims to distinguish between Variance, Covariance, Correlation, Autocovariance, and Autocorrelation. Furthermore there will be a numerical discrete example for computing the Autocorrelation in python. This article will not derive the formulae mathematically, only state them.
Variance measures the spread of a r.v. (random variable).
\[\begin{equation} Var(X) = E[( X-\bar{x} )^2] = \sum_{i=1}^{n} { (x_i -\bar{x})^{2} f(x_i)} = E[X^2]- \bar{x}^2 \end{equation} \tag{1}\label{eq:eq1} \]
If we compute the variance of a sample (I will write an article about this soon), then the formula for the sample variance, which we refer to as ‘s’ is slightly different, instead of multiplying with $f(x_i)$, we multiply with $1 / (n-1)$ where n is the length of the squence.
\[\begin{equation} s(X) = \frac{1}{n-1}\sum_{i=1}^{n} { (x_i -\bar{x})^{2} } \end{equation} \tag{2}\label{eq:eq2} \]
Similarly, if you have two r.v.s, X and Y of the same length $n$, you can measure how much they tend to change together (co-vary), e.g. if one is going up is the other going up too? Or down? Or none of that?
\[\begin{equation} Cov(X,Y) = E[( X-\bar{x} ) (Y-\bar{y})]= \frac{1}{n-1} \sum_{i=1}^{n} { (x_i -\bar{x})(y_i -\bar{y}) }= E[XY] – \bar{x}\bar{y} \end{equation} \tag{3}\label{eq:eq3} \]
In the above formula we multiplied by $\frac{1}{n-1}$ as we did for the sample variance.
We can normalize the Covariance such that its outcome is between -1 and 1. Where 1 means that the two variables have a perfect “the more X, the more Y”- relationship, and -1 means a perfect ‘the more X, the less Y’. This normalized version is called the Correlation:
\[\begin{equation} Cor(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)(Var(Y)}} = \frac{E[( X-\bar{x} ) (Y-\bar{y})] }{\sqrt{Var(X)(Var(Y)}} \end{equation} \tag{4}\label{eq:eq4} \]
If we apply the function for the expected value ($E[\cdot]$) we get:
\[\begin{equation} Cor(X,Y) = \frac{\frac{1}{n-1} \sum_{i=1}^{n} { (x_i -\bar{x})(y_i -\bar{y}) } }{\sqrt{Var(X)(Var(Y)}} \end{equation} \tag{5}\label{eq:eq5} \]
Now, let’s expand the function for the variance ($Var[\cdot]$) (actually we use the sample variance) and the result is:
\[\begin{equation} Cor(X,Y) = \frac{\frac{1}{n-1} \sum_{i=1}^{n} { (x_i -\bar{x})(y_i -\bar{y}) } }{\sqrt{\frac{1}{n-1}\sum_{i=1}^{n} { (x_i -\bar{x})^{2} }\frac{1}{n-1}\sum_{i=1}^{n} { (y_i -\bar{y})^{2} }}} \end{equation} \tag{6}\label{eq:eq6} \]
Note, that we can take the $\frac{1}{n-1}$ out of the square root and that it cancels out with the same term in the numerator:
\[\begin{equation} Cor(X,Y) = \frac{\cancel{\frac{1}{n-1}} \sum_{i=1}^{n} { (x_i -\bar{x})(y_i -\bar{y}) } }{\cancel{\frac{1}{n-1}}\sqrt{\sum_{i=1}^{n} { (x_i -\bar{x})^{2} }\sum_{i=1}^{n} { (y_i -\bar{y})^{2} }}} \end{equation} \tag{7}\label{eq:eq7} \]
If we measure a stochastic process (a process whose outcome is considered random) at equal time intervals, we have one measurement per discrete time step t. For example we check the amazon stock every day at 4 pm for 100 days. Then we have a sequence of 100 numbers, which we call X. We denote the first measurement $X_0$, the measurement from the second day $X_1$ etc. If we model this as an array in Python and call this array ‘X’, then X[0] returns our first day measurement.
Now we can compute the autocorrelation of this sequence. This means we compute the correlation of the sequence with itself and a shifted version of itself. If our sequence X is =[1, 2 ,3 ,4 ,5 ,6, 7, 8, 9] we could compute the correlation between A=[1, 2, 3, 4, 5, 6] and B=[4, 5, 6, 7, 8, 9]. So A comprises the first 6 values of X and B all the values starting at index 3 (the first index is 0). In other words B is shifted by 4. This shift is referred to as the ‘lag’ and in formulas it is often denoted $\tau$.
Now we can write the formula for the autocorrelation:
\[\begin{equation} Autocor(X[i],X[i+t]) = \frac{\sum_{i=1}^{n-t} { (X[i] -\bar{X})(X[i+t] -\bar{X}) } }{\sqrt{\sum_{i=1}^{n} { (X[i] -\bar{X})^{2} }\sum_{i=1}^{n} { (X[i] -\bar{X})^{2} }}} \end{equation} \tag{8}\label{eq:eq8} \]
Note that it in above formula we considered the mean of the entire sequence X. This is how the software module statsmodels computes the autocorrelation. It would also be valid to distinguish between the sequence and its shifted version, like this:
\[\begin{equation} Autocor(X[i],X[i+t]) = \frac{\sum_{i=1}^{n-t} { (X[i] -\bar{A})(X[i+t] -\bar{B}) } }{\sqrt{\sum_{i=0}^{n-t} { (X[i] -\bar{A})^{2} } { \sum_{i=t}^{n}(X[i] -\bar{B})^{2} }}} \end{equation} \tag{9}\label{eq:eq9} \]
Here, A= X[0:n-t], and B = X[t:n].
This is how the pandas.Series.autocor function computes it.
Here is some code to demonstrate this.
Was this helpful?
2 / 0