## 1. Introduction

*a*

^{′}

_{i}

*a*

_{i+τ}), we will consider only the case of a zero-lag sample correlation, which is given bywhere

*N*is length of the time series, and for

*x*=

*a,*

*b,*

*x̄*is defined byand

*σ̂*

_{X}

Estimating the statistical significance of a correlation is simple when the data are Gaussian distributed and have no serial correlation (i.e., the autocorrelation is zero for nonzero lags). However, serial correlation is present in most geophysical time series. In addition, the serial correlation is often enhanced by the common practice of applying a low-pass filter to the data. (A low-pass filter reduces the temporal variability, thus increasing the lag-1 autocorrelation.) If unaccounted for, serial correlation would make the statistical tests less stringent. That is, one could incorrectly reject the null hypothesis, *ρ*_{AB} = 0, with a probability greater than the nominal significance level. Consequently, one needs a test that is relatively insensitive to the serial correlation.

There are three common methods for dealing with time series that are serially correlated. One method is to subsample the data; that is, use only some of the data in order to reduce the temporal resolution. For example, the surface temperature shows a strong day-to-day correlation. However, if the temperature measurements were separated by one month, the autocorrelation would be much smaller and perhaps we could assume that the data are independent.

Subsampling the data has been used in meteorology; for example, one often assumes that data separated by one month are statistically independent. However, phenomena such as the Southern Oscillation or the quasi-biennial oscillation have much longer timescales. If we want to eliminate the autocorrelation due to the Southern Oscillation, then we may have to separate the samples by many years. Given the length of many meteorological time series, this approach is often impractical.

*τ*| <

*N*can be estimated by

Thiébaux and Zwiers (1984) estimated *N*_{eff,ā} using Eqs. (1) and (3) for a large number of random series and found that effective sample size could not be estimated consistently. The variability in the sample autocorrelation [ Eq. (3)] caused the standard deviation of the estimated effective sample size to be larger than the effective sample size. These conclusions were consistent with Trenberth (1984), who also found difficulties in reliably estimating the autocorrelation. In addition, Thiébaux and Zwiers (1984) point out that the effective sample size should not be used in Student’s t-test to estimate the significance of the difference of two means. Consequently, they suggest that the effective sample size should be considered only as a diagnostic quantity.

Yet another method to deal with serial correlation is to model the data statistically and then to determine the distribution of the desired quantities from the statistical model. Katz (1982), for example, fit some climate-simulation data to an autoregressive model whose order was determined by a Bayesian information criterion. Katz then found the variance of the time means from the statistical model. Thus, he was able to determine if the time means had changed significantly in his simulations. This approach involves identifying the appropriate statistical model and estimating that model’s parameters.

The previous methods dealt with significance tests for the hypothesis that the correlation is zero. A related hypothesis is that the two series are statistically independent. Obviously a nonzero correlation implies statistical dependence but not vice versa. Nevertheless, the introduction would be incomplete without some mention of the two commonly used techniques.

The first method is to “prewhiten” the time series before computing the lag correlations. Prewhitening involves filtering the time series to remove the serial correlation. For example, suppose *x*_{i} was generated by an AR(1) process; that is, *x*_{i} = *β**x*_{i−1} + *ϵ*_{i}, where *ϵ*_{i} is a white noise forcing. Then applying a simple filter (*x*^{′}_{i}*x*_{i} − *β**x*_{i−1}) gives a new time series, *x*^{′}_{i}

Another method is to examine the significance of the squared coherence from a cross-spectral (bivariate spectral) analysis. The procedure gives the relationship between the two time series as a function of the frequency. However, sample size can be an issue as one would probably not attempt a cross-spectral analysis with 8, 16, or 32 samples. Even when *N* is large, such as in time series of 20 yr of global atmospheric analyses, one would expect fewer than 10 cycles of the quasi-biennial oscillation and even fewer warm El Niño events during that period. While *N* may be large, the phenomenon of interest may be undersampled. Both of these techniques are discussed in textbooks (e.g., Brockwell and Davis 1991).

## 2. “Random-phase” test

Consider two time series, A and B, which were chosen from population PA and population PB, respectively. The significance of the computed correlation between A and B, *ρ̂*_{AB}*α* × 100% of these randomly generated correlations had a magnitude greater than *ρ*_{crit}, then one could reject the null hypothesis at the *α* significance level whenever *ρ̂*_{AB}|*ρ*_{crit}. However, this test is usually impractical as PA and PB are often unknown.

When the distributions of PA and PB are unknown, resampling methods attempt to generate random time series that have similar properties to the members of PA and PB. For example, the “bootstrap” can be used to evaluate the significance of *ρ̂*_{AB}*R*^{k}), which are formed by taking elements of A at random; that is, *r*^{k}_{i}*a*_{j}, where *j*(*i, k*) is random integer between 0 and N − 1, and k ranges between 1 and number of resampled time series. The correlations between *R*^{k} and *B* can be used to determine *ρ*_{crit} as described in the previous paragraph.

The idea behind resampling techniques is to generate random series that have properties similar to the original series, and to derive a confidence interval using these random series. Resampling depends on preserving the important properties of the original series. Zwiers (1990) tested the procedure described in the previous paragraph and found that it worked well when the sampling assumptions were applied [“(a) all observations within a sample come from the same distribution; (b) observations are taken independently of each other; and (c) samples are taken independently of each other.”] However, when the series were serially correlated, the observations were no longer independent of each other, which violated a sampling assumption. The serial correlation decreased the effective sample size [Eq. (2)] and consequently increased *ρ*_{crit}. The bootstrap test, however, was not influenced by the serial correlation, and Zwiers found the test became more biased with increasing serial correlation. This illustrates a problem with the bootstrap—the generated random series are not serially correlated unlike many time series.

This paper considers another technique, which could be considered “resampling” in the frequency domain. This procedure will not preserve the distribution of values but rather the power spectrum (periodogram). The advantage of preserving the power spectrum is that resampled series retains the same autocorrelation as the original series. The steps involved in this “random-phase” test are as follows.

- Compute the discrete Fourier transform of the first series,
*a*_{i}, by,where*δ*_{k}= 0 except for*k*= 0,*N*/2 (even*N*) in which case*δ*_{k}= 1. - Let
*r̃*_{0}= 0,*r̃*_{k}= |*ã*_{k}|exp(*i**θ*_{k}) for 0 <*k*<*N*/2, and*r̃*_{N/2}= 2^{1/2}|*ã*_{N/2}|cos(*θ*_{N/2}) for even*N,*where*θ*_{k}is a uniform random variable from [0, 2*π*). - Compute the inverse Fourier transform of
*r̃*_{k}bywhere*n*=*N*/2 or*N*−12 for even and odd*N*,respectively.

Step 2 creates a Fourier series with random phases and the same power spectrum as the original series. The two special cases are *k* = 0 and *k* = *N*/2. Setting *r̃*_{0} = 0 is simply setting the mean of *r*_{i} to zero, which is inconsequential for a correlation. For *k* = *N*/2, we are dealing with the Nyquist frequency. At this frequency, one can sample only Re (*ã*_{N/2}). If we assume Im(*ã*_{N/2})^{2} has the same expected value as Re(*ã*_{N/2})^{2}, then the expected value of |*ã*_{N/2}|^{2} is 2Re(*ã*_{N/2})^{2}, and the real part of the random-phase spectral component is *r̃*_{N/2} = 2^{1/2}|*ã*_{N/2}|cos(*θ*_{N/2}).

In step (3), the inverse Fourier transform gives *r*_{i}, a random series with the same autocorrelation as the original series. The main difference is that *r*_{i} has random phases in each of its Fourier modes.

It should be noted that this procedure is very similar to the process of creating “surrogate” data (Fraser 1989; Kaplan 1993; Theiler et al. 1993; Gershenfeld and Weigend 1993), which has been used to test for nonlinearities in time series. However, these studies used *r̃*_{N/2} = |*ã*_{N/2}|cos(*θ*_{N/2}) rather than 2^{1/2}|*ã*_{N/2}|cos(*θ*_{N/2}).

Once we have created a large number of random-phase series, we can calculate the significance in a manner similar to the bootstrap method. We take these random series and correlate them with the second series. (This step can be computed in the spectral domain for computational efficiency.) For example, if fewer than *α* × 100% of the correlations from the random-phase series have a magnitude greater than *ρ*_{crit}, then one can reject the null hypothesis, *ρ*_{AR} = 0, at the *α* significance level whenever *ρ̂*_{AB}|*ρ*_{crit}.

The random-phase test is a nonparametric test, which attempts to create random time series that could have come from PA, the population from which A is a member. The random-phase test is inappropriate for data with strong seasonality. In such a situation, the “random” time series are obviously not from PA as the phases of “plausible” time series are not arbitrary.

## 3. Tests using AR(1) test data

The performance of the random-phase test needs to be evaluated. At a minimum, one would like an unbiased test; that is, the null hypothesis is rejected at a rate equal to the nominal significance level whenever the null hypothesis is true. To evaluate the bias of the test, random time series were created using AR(1) and AR(2) models (first- and second-order autoregressive models).

### a. Random-phase test

*x*

_{i+1}

*β*

*x*

_{i}

*ϵ*

_{i}

*β*is the specified lag-1 autocorrelation, and

*ϵ*

_{i}is a normally distributed random variable with a zero mean and a variance of one. Since the two models are independent, the correlation between the time series has an expected value of zero.

We then subjected each pair of time series to a random-phase test and found the fraction of times the null hypothesis was rejected, that is, the rejection rate. Ideally, for the *α* significance level, we would reject only the null hypothesis *α* 100% of the time. For each random-phase test, 2000 random-phase time series were used to generate *ρ*_{crit}.

Figures 1–4 show the rejection rates for the correlation of independent AR(1) series at the 0.1, 0.05, and 0.01 significance levels using the random-phase test. The rejection rates are close to expectation and are much better than those for the bootstrap method. It should be noted that the curves have some “sampling” error, and the errors for different values of *β* are correlated because the random number generator used the same initial seed.

In Figs. 1–4, the random-phase test is too liberal as *β* approaches one. This is caused by poorly resolving the low frequencies (periods greater than or on the order of the length of the time series). As *β* approaches one, the AR(1) series have more power in these poorly or unresolved low frequencies. These unresolved low frequencies often appear as trends in the time series, and trends tend to (anti-) correlate with each other. The random-phase series, however, are periodic by construction and do not have trends except for very specific phases. Consequently, the magnitude of the correlation between randomly selected AR(1) series would tend to be larger than for the random-phase series. Thus, testing against random-phase series will produce a test with a too large rejection rate. This hypothesis was supported by tests using sample data that had its linear tread removed. As *β* approached one, the test did not show as large a liberal bias (not shown).

### b. Conventional test

*β*was estimated by

As seen in Fig. 1–4, the rejection rates for the random-phase and conventional tests are close to the nominal rates for large *N* and small to moderate *β*. While both tests suffer from a liberal bias for *β* approaching one, the bias was worse for the conventional test. The random-phase test appears to be relatively unbiased for a greater parameter range, especially for small *N.*

*N*was small or

*β*approached one. The problems with the conventional test were caused by underestimating

*β*(Eq. 4), which can be approximated aswhere the overbar denotes the mean value. Since

*x*

_{i+1}=

*β*

*x*

_{i}+

*ϵ*

_{i}, the rhs of the previous equation can be approximated as

The previous equation indicates that *β* is underestimated when the sample mean (*x̄*) differs from the true mean (0). This difference of between the sample and true means is expected to increase for smaller *N* or for *β* closer to one that is consistent with the cases that the conventional test did poorly.

The underestimation of *β* caused the random series to have a smaller autocorrelation than the original data. Consequently, the test rejected the null hypothesis, *ρ*_{AB} = 0, too frequently. Both the random-phase and conventional tests became more liberal as *β* approached one; however, the random-phase test did perform better in this situation.

## 4. Tests using AR(2) test data

*ϵ*):When

*ϵ*= 0, the solution is

*ψ*(

*t*) =

*a*exp((

*i*

*ω*−

*λ*)

*t*) +

*b*exp((−

*i*

*ω*−

*λ*)

*t*), where

*a*and

*b*are constants determined from the initial conditions. If we make a finite-difference approximation by replacing

*d*

^{2}

*ψ*/

*dt*

^{2}by (

*ψ*

_{i+2}− 2

*ψ*

_{i}+

*ψ*

_{i−1})/Δ

*t*

^{2},

*d*

*ψ*/

*dt*by (

*ψ*

_{i+1}−

*ψ*

_{i−1})/2Δ

*t,*and letting Δ

*t*= 1, we get

In Eq. (5), *ω* is the change in phase angle, and *λ* is the decay factor between adjacent data points. For our tests, *ϵ*^{′}_{i}

In Fig. 5, we show some typical AR(2) time series with *ω* = *π*/6 (30° phase shift) and various values of *λ*. For presentation, the values have been normalized and offset in the vertical. For small values of *λ*, the time series are almost sinusoidal. For moderate values, the time series has a preferred timescale between adjacent extrema, and for large values, the series has no apparent preferred timescale.

To calculate the rejection rate, 1000 pairs of time series were generated by two independent AR(2) models and the correlation was calculated for each pair. Each of these 1000 correlations was subject to a random-phase test that used 1000 random-phase time series. The critical correlation was computed and the fraction of times the null hypothesis, *ρ*_{AB} = 0, was rejected is shown in Figs. 6 and 7. These two figures show the rejection rate as a function of *λ* and *N* (curves a–d) for *ω* = *π*/6. The rejection rate is close to the nominal significance except when *λ*, the damping factor, is small. When *λ* is too small, the rejection rate is too large, and this effect is more pronounced when *N* is also small (*λ* < 0.1 for *N* = 64, and *λ* < 0.4 for *N* = 16). This large rejection rate may be caused by spectral leakage. When *λ* is near zero, the sample time series is nearly sinusoidal and the true spectrum is a single narrow peak. The discrete Fourier transform, however, broadens the spectral peak (spectral leakage), which is more of a problem when *N* is small and the time series does not contain an integral number of periods (i.e., *N**ω*/2*π* is not an integer). This hypothesis is supported by Fig. 8, which shows the rejection rate is close to the nominal value when *N**ω*/2*π* is an integer. When *N**ω*/2*π* < 1, the rejection rate was too high, which is the result of the period of the sinusoid being longer than the time series. As discussed in the AR(1) tests, unresolved low frequencies can increase the rejection rate.

*β*= 0.5) and AR(2) series (curve e). In both cases, the rejection rate is close to the nominal value. The test is unaffected by the spectral leakage in estimating the AR(2) power spectrum because we randomized the phase of the AR(1) series. It should be mentioned that this test is symmetric, the test for

*a*

_{i}and

*b*

_{i}is the same as the test for

*b*

_{i}and

*a*

_{i}except for the variability introduced by different random phases. This result is apparent as the correlation between

*b*

_{i}and the random-phase series,

*r*

_{i}, can be written aswhere

*ϵ*

^{′}

_{k}

*π*). From the above equation, we can infer that the test is sensitive to spectral leakage only if it affects both time series.

## 5. Summary

In summary, we suggested a nonparametric test for the significance of a nonzero correlation between two time series when serial correlation is a concern. The statistical test is based on generating a large number of random series with the same power spectra (periodogram) as the first series but with random phases in the Fourier modes. The statistical significance of the original correlation was found by comparing this correlation to the distribution of correlations between the random-phase series and the second original series.

The performance of this test was evaluated by generating independent time series and comparing the rejection rate, the fraction of the time that the *ρ*_{AB} = 0 hypothesis was rejected, to the nominal significance level. With time series created by an AR(1) model, the rejection rates were close to the expected values and was relatively insensitive to the length of the time series. The rejection rate was too large when trends were dominant, and significant power was in the unresolved low frequencies. These results compared favorably to a conventional test where the data was statistically modeled as from an AR(1) process.

In the AR(2) tests (stochastically forced, damped oscillator), the rejection rates for the random-phase test were close to the expected rates except when *λ*, the damping, became small. The effect of small *λ* was more pronounced on the smaller *N* cases. We attributed this problem to spectral leakage, which made the periodogram a poor estimate of the spectral power.

The advantage of a random-phase test is its simplicity. It is similar to the bootstrap in that we create random samples that have similar properties to the original data. Instead of preserving the distribution of the original data, we preserve the autocorrelation of the data. The remainder of the test is similar to the bootstrap. Compared with a conventional test using AR(1) data, the random-phase test showed less bias in the rejection rates and avoided the task of identifying the appropriate statistical model for the test.

## Acknowledgments

This work was inspired by a pair of lectures given by Dr. R. Livezey on statistical evaluation of forecasts and general circulation models. Discussions with Dr. F. Zwiers and comments from two anonymous reviewers were also very helpful.

## REFERENCES

Brockwell, P. J., and R. A. Davis, 1991:

*Time Series: Theory and Methods.*Springer-Verlag, 577 pp.Davis, R. E., 1976: Predictability of sea surface temperatures and sea level pressure anomalies over the North Pacific Ocean.

*J. Phys. Oceanogr.,***6,**249–266.Fraser, A. M., 1989: Reconstructing attractors from scalar time series: A comparison of singular system and redundancy criteria.

*Physica D,***34,**391–404.Gershenfeld, N. A., and A. S. Weigend, 1993: The future of time series: Learning and understanding.

*Time Series Prediction: Forecasting the Future and Understanding the Past,*A. S. Weigend and N. A. Gershenfeld, Eds., Addison-Wesley, 1–70.Kaplan, D. T., 1993: A geometrical statistic for detecting deterministic dynamics (data sets A, B, C, D).

*Time Series Prediction: Forecasting the Future and Understanding the Past,*A. S. Weigend and N. A. Gershenfeld, Eds., Addison-Wesley, 415–428.Katz, R. W., 1982: Statistical evaluation of climate experiments with general circulation models: A parametric time series modeling approach.

*J. Atmos. Sci.,***39,**1446–1455.Theiler, J., P. S. Linsay, and D. M. Rubin, 1993: Detecting nonlinearity in data with long coherence times (data set E).

*Time Series Prediction: Forecasting the Future and Understanding the Past,*A. S. Weigend and N. A. Gershenfeld, Eds., Addison-Wesley, 429–455.Thiébaux, H. J., and F. W. Zwiers, 1984: The interpretation and estimation of effective sample size.

*J. Climate Appl. Meteor.,***23,**800–811.——, and M. A. Pedder, 1987:

*Spatial Objective Analysis: With Applications in Atmospheric Science.*Academic Press, 299 pp.Trenberth, K. E., 1984: Some effects of finite sample size and persistence on meteorological statistics. Part I: Autocorrelations.

*Mon. Wea. Rev.,***112,**2359–2368.Zwiers, F. W., 1990: The effect of serial correlation on statistical inferences made with resampling procedures.

*J. Climate,***3,**1452–1461.