# «Size distortions of tests of the null hypothesis of stationarity: evidence and implications for the PPP debate M. Caner a, L. Kilian b, c,* a ...»

It is interesting to compare our ﬁndings to the results of previous studies. For our choice of l, Culver and Papell (1999) reject the null of stationarity at the quarterly frequency for Australia, Ireland, and Japan.7 Using our updated sample, we obtain the same rejections for the quarterly data plus Canada and Switzerland. In contrast, at the monthly frequency, the KPSS test rejects the null of stationarity for seven of 17 countries (Austria, Canada, Greece, Italy, Portugal, Spain, Switzerland). This patUsing smaller values of l, they reject the null of stationarity for 10 of their 17 countries. This result is consistent with evidence in Kwiatkowski et al. (1992) and Lee (1996) that small values of l tend to result in spurious rejections of the null hypothesis.

650 M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 Table 2 (continued)

a All real exchange rates are constructed from IFS CD-ROM data on consumer price and end-of-period ˆ US$ exchange rates. Monthly data are for 1973.1–1997.4 (292 observations). p refers to the Akaike Information Criterion (AIC) lag order estimate of the ARIMA(p,1,1) model. Lag orders are constrained to lie between 0 and 8. l=int[12(T/100)1/4] where int denotes the integer part. For the DF-GLS test the lag order is selected using a sequential t-test with an upper bound of p=12. At the 5 (10)% signiﬁcance level, the asymptotic critical value for the LMC and the KPSS test is 0.463 (0.347). The ﬁnite-sample critical values for the DF-GLS test are 2.086 ( 1.760). ** (*) denotes a rejection at the 5 (10)% level.

b All real exchange rates are constructed from IFS CD-ROM data on consumer price and end-of-period ˆ US$ exchange rates. Quarterly data are for 1973.I–1997.II (98 observations). p refers to the AIC lag order estimate of the ARIMA(p,1,1) model. Lag orders are constrained to lie between 0 and 3 for the quarterly data. l=int[12(T/100)1/4] where int denotes the integer part. For the DF-GLS test the lag order is selected using a sequential t-test with an upper bound of p=8. At the 5 (10)% signiﬁcance level, the asymptotic critical value for the LMC and the KPSS test is 0.463 (0.347). The ﬁnite-sample critical values for the DF-GLS test are 2.308 ( 1.964). ** (*) denotes a rejection at the 5 (10)% level.

tern is consistent with the evidence of increasing size distortions, as r and T are increased, for both the LMC and KPSS test (Table 1a).

As in the case of the LMC test, the observed rejections of the stationarity null by the KPSS test are not informative, given the size distortions of the KPSS test based on asymptotic critical values. It is quite possible that the observed rejections are spurious. This view is supported by test results for the asymptotically efﬁcient DFGLS test of the unit root hypothesis. We focus on this test because its power compares favorably to standard ADF tests (Elliott et al., 1996; Cheung and Lai, 1998).

The DF-GLS test for the case of unknown mean is based on the following regression M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 651

with zt=1 and b being the least-squares regression coefﬁcient of yt on zt, the latter ˜ ˜ being deﬁned by yt=[y1,(1 rL)y2,…,(1 r T] and zt=[z1,(1 rL)z2,…,(1 ¯ ¯L)y ¯ ˜ ˜ rL)zT]. The DF-GLS test statistic is given by the t-ratio for testing H0 : f0=0 against ¯ the one-sided alternative H0 : f0 0. Implementation of the test requires the choice of the parameter c We follow Elliott et al.’s recommendation and set c= 7.

¯ ¯ Rather than using the asymptotic critical values for the DF-GLS test we rely on approximate ﬁnite-sample critical values. Finite-sample critical values under the unit root null hypothesis may be obtained by simulation as described in Elliott et al.

(1996). We depart from that procedure in that we allow for some serial correlation under the null hypothesis. We postulate an ARIMA(0,1,1) model, consistent with the assumptions of the LMC and KPSS tests, with q=0.25. This speciﬁcation accounts for the presence of a small nonzero MA(1) component in the growth rates of many economic time series, including real exchange rates (Froot and Rogoff, 1996; Lothian and Taylor, 1996; Canzoneri et al., 1999; Engel and Kim, 1999). In ﬁtting the ADF model and in calculating the ﬁnite sample critical values, we use sequential t-tests with upper bounds of eight autoregressive lags in the quarterly case and 12 lags in the monthly case. Thus, our critical values allow for lag order uncertainty.

For the same data set, for which the LMC test (and to a lesser extent the KPSS test) ﬁnd strong evidence against stationarity, Table 2 shows that the 5 (10)% DFGLS test rejects the unit root null hypothesis for 3 (0) of the 17 countries for which monthly data are available and for 11 (15) of the 20 countries for which quarterly data are available.8 In fact, for several countries the test results for the DF-GLS test directly contradict those for the stationarity tests. For example, for Greece and Italy the KPSS test rejects stationarity at the monthly frequency, yet the DF-GLS test rejects the unit root null hypothesis. No such contradictions occur at the quarterly frequency. For the LMC test the contradictions are more numerous and include three countries for the monthly data and 14 countries for the quarterly data.

Of course, it is possible that some of these contradictions are driven by size distortions of the DF-GLS test. Investigating that possibility would be beyond the scope of this paper. More importantly, even if it could be shown that these obvious contradictions disappear, as the critical values of the DF-GLS test are adjusted, the fact Cheung and Lai (1998) analyze three of the 17 monthly real exchange rates used in our study using the same DF-GLS test. They are able to reject the unit root hypothesis for the UK, France, and Germany.

We obtain similar results for France and Germany, but not for the UK The difference in the results is driven by the sample period.

652 M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 remains that the KPSS test and LMC tests have a tendency to reject the null of longrun PPP, even when it is true.

Our example of the problems in interpreting the results of stationarity tests is not an isolated case. The apparent tendency of the KPSS and LMC tests to reject the null hypothesis of stationarity in empirical work has been noted by other researchers.

For example, Cheung and Chinn (1997, p. 71) reported rejections of the null hypothesis of trend stationarity for quarterly US GNP at the 1% level based on asymptotic critical values. Several other researchers remarked on the decisive nature of their evidence against stationarity. Our evidence that the LMC and KPSS tests suffer from severe size distortions provides a plausible explanation of the source of these ‘strong’ rejections of stationarity. The next section will investigate the extent to which the use of size-adjusted critical values can overcome the problems of interpreting the results of stationarity tests.

**5. Size-adjusted power under economically plausible assumptions**

One might conjecture that the size distortions of stationarity tests could be overcome easily by the use of appropriately adjusted ﬁnite sample critical values. Indeed, that strategy has been pursued in recent papers by Cheung and Chinn (1997), Rothman (1997) and Kuo and Mikkola (1999). However, as this section will illustrate, such corrections inevitably result in a dramatic loss of power.

Under the null hypothesis H0 : s2 =0, the local levels model underlying the LMC h and KPSS test reduces to a stationary process. Thus, it is straightforward to construct ﬁnite-sample critical values from an approximating stationary AR(p) model. Note that ﬁnite-sample critical values derived from such a parametric model violate the nonparametric spirit of the KPSS test, but are consistent with the parametric assumptions of the LMC test. We nevertheless will examine the performance of both tests using size-adjusted critical values.

The power of stationarity tests based on size-adjusted critical values will clearly depend on the persistence of the process under the null. Thus, we know that power may be arbitrarily low in general. The only interesting question is what the power of the test will be for models under the null hypothesis that are economically plausible. One appealing way of parameterizing the persistence of the process under the null is to appeal to the emerging consensus in the PPP literature about the value of the half-life of shocks to the real exchange rate. By deﬁnition, models that are consistent with this consensus must be considered economically plausible. As shown in Section 4.1, in the context of the AR(1) model, the half-life consensus translates into roots between 0.986 and 0.989 for monthly data and between 0.944 and 0.981 for quarterly data. We therefore calculate size-adjusted critical values under the hypothetical assumption that the true process is a stationary AR(1) process with degrees of persistence corresponding to the upper and lower bound of the half-life consensus.

Our approach differs from the usual approach of bootstrapping the model under the null as implemented for example by Kuo and Mikkola (1999). Note that here we are interested not in ﬁnding the statistically most plausible model of the data M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 653 generating process under the null of stationarity (which is what the bootstrap approach aims to do), but in assessing the potential power of the test under economically plausible assumptions about the speed of convergence to PPP under the null.

We focus on T=100 and T=300. These are representative sample sizes for quarterly and monthly real exchange rate data under the recent ﬂoat. The size-adjusted critical values for this AR(1) process differ greatly from their asymptotic counterparts. For example, for a half-life of three years and quarterly data the 5% critical value for the KPSS (LMC) test is 0.698 (5.893) compared with the asymptotic critical value of 0.463. For monthly data these values rise to 1.438 and 17.210, respectively. For a half-life of ﬁve years and quarterly data the size-adjusted critical values are 0.749 for the KPSS test and 7.271 for the LMC test; for monthly data and a half-life of ﬁve years we obtain critical values of 1.590 for the KPSS test and 21.426 for the LMC test.9 In Table 3, we use our critical values to compute the size-adjusted power of the KPSS and LMC tests against the ARIMA(0,1,1) alternative with q=0.25. The latter process is the same process used to construct the critical values for the DF-GLS test.

Table 3 suggests three conclusions. First, the size-adjusted power may be as low as 20% for the KPSS test and as low as 22% for the LMC test. This means that only in the rare case of a rejection of the null hypothesis will the test shed light on the question of whether long-run PPP holds or not. In the absence of a rejection, test results based on ﬁnite-sample critical values are not going to be informative. Second, for a given test, size-adjusted power is generally higher for monthly data than for quarterly data. Third, the LMC test tends to have higher size-adjusted power than the KPSS test for the same process. For example, for monthly data and a half-life of ﬁve (three) years under the null, the LMC test detects the unit root process probability 26.8% (42.5%) compared with 21.7% (31.7%) for the KPSS test. Of course, Table 3 Power (in %) of the LMC and KPSS tests of the null hypothesis of level stationarity based on sizeadjusted critical values (Based on 20 000 trials from (L)(1 L)yt=(1 qL)zt with q=0.25 and NID errors.

Power is calculated using size-corrected critical values under the hypothesized AR(1) null)

It is important to note that these size-adjusted critical values differ from the conventional ﬁnitesample critical values used for example by Culver and Papell. The latter are derived under the counterfactual assumption that the data generating process is white noise. While they allow for the estimation of higher-order models and adjust the critical values for the sample size, they do not adjust for the persistence of the process under the stationarity null. Thus, they are closer in spirit to the asymptotic critical values suggested by Kwiatkowski et al. (1992) than to our critical values.

654 M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 this result may be an artifact of the parametric nature of the data generating process which favors the LMC test.

Clearly, if the true process for the real exchange rate is stationary, it need not follow an AR(1) process. Nevertheless, the size-corrected critical values based on the AR(1) model provide a useful benchmark. For example, if we had used these ﬁnite-sample critical values for a half-life of ﬁve years in Table 2, none of the test statistics for the monthly data would have been signiﬁcant and only the Japanese test statistic for the quarterly data. For a half-life of three years, both tests would have rejected stationarity for the Japanese quarterly data, but only the LMC test for the Japanese monthly data. For no other country stationarity would have been rejected. Of course, given the low power of the test for economically plausible models, this outcome is not unexpected.

Since stationarity tests are almost as likely not to reject the null because of low power as they are likely not to reject because stationarity indeed holds, the fact that rejections rarely occur cannot be interpreted as convincing evidence in favor of longrun PPP. Thus, there is little value added from conducting stationarity tests with size-adjusted critical values, except in rare cases like Japan, for which there is some evidence against long-run PPP. This thought experiment suggests that tests of the null hypothesis of stationarity will tend to be useful only for sample sizes much larger than those used in the PPP debate. While the further theoretical development of such tests continues at a rapid pace, their usefulness for applied work is open to question.

**6. Conclusions**