How Not to Analyze Data, part 3
Anthony Watts and Basil Copeland have collaborated on another post claiming to establish a connection between solar activity (as proxied by the sunspot cycle) and global temperature (as indicated by the HadCRUT3v data set). Let’s take a close look.
First, they smooth the temperature time series using a Hodrick-Prescott filter. It has a tuneable parameter which enables one to choose the amount of smoothing which the filter imposes; the parameter is somewhat analagous to the “cutoff frequency” in a low-pass filter, or the “characteristic time scale” in other smoothing methods. Watts/Copeland set the tuneable parameter to 7, which effectively smooths the data on a roughly 6-year time scale.
Hodrick-Prescott filtering is an outstanding smoothing method. It’s especially good near the edges of the time span, where many other methods show undesirable “edge effects.” Here, for example, is a comparison of using the H-P filter (labelled HP-7), to using a “running polynomial” smoothing (with a well-chosen weighting function) also on a roughly 6-year time scale (labelled GSR):
Clearly the two methods give nearly identical results except at the very edges of the time span, where the H-P filter performs better. Of course H-P filtering has its drawbacks too, including the facts that it requires an even time spacing and it’s not as useful as other methods for interpolating between the observation times, but for this analysis those aren’t anything to worry about.
They then take the first differences of the smoothed time series; that’s just the difference between a given value, and the preceding value. It’s an estimate of rate of change (i.e., the first time derivative) of the smoothed underlying signal. That’s fine too, except that one has to bear in mind that the values estimated by any smoothing method have an inherent uncertainty. Hence the difference between two consecutive values also has an inherent uncertainty. Here’s their plot of these estimated rates of change (click the graph for a larger, clearer view):
And here’s what I got using the exact same method:
So far, we’re on the same page. But now things start to go wrong. They take the peaks of these rate estimates, and assert that they tend to happen at the same time as the peaks of the sunspot cycle. But they don’t take all the peaks from the estimated rate-of-change plot; they include only the ones that fall near a peak in the sunspot cycle, which are labelled in their plot with numbers indicating which sunspot cycle they correspond to. It’s certainly suspect to include only the peaks that match one’s theory! Even so, there are only 3 peaks which they exclude, so we’ll note that there’s a problem with their methodology, but that it might not be too severe, and move on.
They compare the times of peak of the sunspot cycle to the times of peak warming rate in this graph (click the graph for a larger, clearer view):
They further note that the correlation between the times of sunspot peak and the times of warming-rate peak is a whopping 0.99846454. Now that is mighty impressive; it certainly deserves some exclamation points!!
But there are two problems with this graph. One is that one of the data points is wrong. Look at the points for cycle 11: they plot both sunspot cycle peak and warming-rate peak right on top of each other, happening in the same year, 1870. That’s the right year for the sunspot cycle peak, but not for the warming-rate peak; that doesn’t happen until 1877. Here’s a corrected graph with the erroneous (now fixed) point circled:
It’s curious that the single data point for which the difference between the time of the sunspot cycle peak and of the warming-rate peak is greatest, is the data point which is in error on their graph.
But there’s a far more serious problem. They’re correlating two data series which are both sorted into ascending order. That means that without even knowing what the numbers are, we can be sure that they’ll correlate strongly; as one series gets bigger so will the other, in quite strict relation because they’re two series in the same order. Let me illustrate: I generated 14 random numbers (one for each solar cycle peak) between 0 and 1 (following a uniform distribution), then sorted them into ascending order, then plotted solar-cycle peak year against the random numbers and computed the squared correlation. Here’s the result:
That’s a mighty impressive squared correlation; it deserves some exclamation points too!! Of course, it’s not really impressive when one realizes that two sorted series will always correlate, and that in this case the “impressive” correlation is between sunspot cycle peak years and random numbers. Consider also that this is the squared correlation; the simple correlation coefficient is 0.9941.
Can we apply a valid test to whether or not the sunspot cycle is synchronized with the peaks in the warming-rate calculation? Let’s take the times of peak of the sunspot cycle and compute the differences; these are estimates of the lengths of the sunspot cycle. We can also take the differences between the times of the warming-rate peaks as selected by Watts and Copeland, as estimates of the lengths of the warming-rate “cycle.” Then let’s compare the sunspot cycle lengths to the warming rate “cycle” lengths. If there’s really a correlation, then longer sunspot cycles should correspond to longer warming-rate “cycles.” Here’s the result:
The correlation between them is not statistically significant, but there are very few data points. More noteworthy is the fact that if there is a correlation, these data indicate that it’s going the wrong way; there’s evidence (but hardly conclusive) that longer sunspot cycles correspond to shorter warming-rate “cycles.”
How do we evaluate the analysis of Watts and Copeland? First, they omit three of the peaks in the warming-rate calculation, because they don’t correspond to their theory. Then they plot the wrong value for the cycle which has the biggest difference between sunspot cycle peak year and warming-rate peak year. Then they compute an impressive-looking correlation for two sorted data series, apparently not even realizing that it’s not much different from what you can get correlating sunspot cycle peak times with random numbers. Finally, a comparison of sunspot cycle lengths with warming-rate “cycle” lengths indicates no significant correlation, although the possible relation is in the wrong direction. All in all, the analysis of Watts and Copeland provides no evidence whatever of any relationship between HadCRUT3v global temperature data, and sunspot data.
In responding to a reader comment, Watts states, “I had a number of qualified people including a published climatologist, solar scientist, a statistician, and a certified consulting meteorologist (among others) look at this beforehand.” Which makes you wonder…
UPDATE UPDATE UPDATE
Readers have wondered, what’s a good test to determine whether or not there’s significant coincidence between the times of sunspot cycle maxima and the times of warming-rate maxima? There are many tests; one I’d like to mention is to compute Rayleigh’s R (see Upton & Fingleton 1989, Spatial Data Analysis by Example, vol. 2, Wiley, Chichester, UK).
We wish to know whether warming-rate maxima tend to occur at some particular time during the sunspot cycle. Each warming-rate maximum will be at some particular phase of the sunspot cycle. If it happens right at sunspot cycle maximum, it’s at phase 0. If it’s halfway from sunspot cycle maximum to the next sunspot cycle maximum, it’s at phase 0.5 for that cycle. If it’s a quarter of the way from one maximum to the next one, it’s at phase 0.25. Etc., etc., you get the idea.
To compute Rayleigh’s R, we first compute the phase \phi for each warming-rate maximum; all phases will fall between 0 and 1. Then we compute (x,y) coordinates for each warming-rate maximum by
x = \cos(2 \pi \phi),~~~ y = \sin(2 \pi \phi).
We can then plot the (x,y) values for each warming-rate maximum on the x-y plane; all these points will fall on the circle of radius 1 (the unit circle).
If all of the warming-rate maxima coincide exactly with the sunspot cycle maxima, then all of the phases will be 0. In that case, all of the warming-rate maxima will be at coordinates (1,0); that’s the point on the unit circle toward the right. If all the warming-rate maxima occur at the midpoint of the sunspot cycle, they’ll all have phase 0.5 and they’ll all have coordinates (-1,0); in this case they’re all 180^\circ out of phase with the sunspot cycle. The plot of (x,y) positions is like plotting points on a “clock” which marks out the sunspot cycle; toward the right (the “3 o’clock” position) is in phase with the sunspot cycle, toward the left (the “9 o’clock” position) is 180^\circ out of phase with the sunspot cycle, etc. etc. you get the idea. And as the sunspot cycle goes forward, the clock goes counter-clockwise.
If warming-rate maxima tend to occur at about the same time in the sunspot cycle, then the points when plotted will aggregate near a particular clock position. If they’re happening at the same time as sunspot cycle maximum, they’ll tend to aggregate near the 3 o’clock position; if they’re happening near the midpoint between sunspot cycle maxima they’ll tend to aggregate near the 9 o’clock position, etc. etc. you get the idea.
To test whether or not they’re aggregating, we compute the average x value and the average y value; call these \bar x and \bar y. We then compute Rayleigh’s R as:
R = \sqrt{\bar x^2 + \bar y^2}.
Suppose all the warming-rate maxima exactly coincide with sunspot cycle maxima. Then all the phases are 0, all the coordinates are (1,0), and the average coordinates are (1,0). In this case, Rayleigh’s R is R=1. Suppose they all fall at some other (but the same for all) phase in the sunspot cycle; again Rayleigh’s R will be R=1. But if they’re distributed randomly throughout the sunspot cycle, then the average coordinates will be approximately (0,0), and Rayleigh’s R will be small.
This is the basis for testing whether warming-rate maxima tend to fall at a particular time during the sunspot cycle using Rayleigh’s R. We compute R, compare it to a critical value, and if it’s greater than that critical value we have evidence that the distribution is not random; if it’s less than the critical value, then the available data are consistent with the hypothesis that they’re occuring at random times throughout the sunspot cycle.
Let’s do it! We’ll use the data according to the method of Watts and Coleman. There are 14 data points (although Watts and Coleman omit 3 of the warming-rate maxima, so we’ll do the same, which is cheating in their favor). Also, they only time the events to the nearest year, so some of the phases turn out to be exactly the same; for instance, two of the warming-rate maxima occur during the same year as sunspot-cycle maxima, so two of the points have phase zero and fall exactly at the 3 o’clock position.
The critical value for Rayleigh’s R, given a false-alarm probability p, applied to N data points, is approximately
R_{crit} \approx \sqrt{ - \ln(p) / N}.
This is only approximate and the sample is small, but as we’ll soon see the result isn’t anywhere near close, so we can ignore the exact calculation (which is generally done by Monte Carlo simulation). We’ll use a false-alarm probability of p = 0.05, which is equivalent to 95% confidence, and N=14 data points, so the critical value is approximately R_{crit} \approx 0.46. Here’s a plot of the results:
Three of the black dots represent two data points; they’re indicated by a “(x2)”. There’s also a very short red line, from the origin to the average coordinates (\bar x, \bar y). And there’s a blue circle indicating the radius corresponding to the approximate 95%-confidence critical value. If the end of the red line extended outside that circle, we’d have evidence that the warming-rate maxima are not randomly distributed, rather that they tend to aggregate at a particular phase of the sunspot cycle. If it were a close call, we’d need to compute the precise critical value for Rayleigh’s R for a sample size of 14 data points.
But it’s clearly not a close call. Rayleigh’s R for this data is R=0.097, the critical value is \approx 0.46. Like I said, not even close.
So in fact there’s no evidence that warming-rate maxima, computed according to the method of Watts and Coleman, tend to fall at any particular phase of the sunspot cycle. Not only do they fail to occur at the same time as sunspot cycle maximum (at phase zero), they don’t tend toward any particular phase of the sunspot cycle.
http://tamino.wordpress.com/2008/04/01/how-not-to-analyze-data-part-3/