# «Size distortions of tests of the null hypothesis of stationarity: evidence and implications for the PPP debate M. Caner a, L. Kilian b, c,* a ...»

After the ﬁrst version of this paper was written, we became aware of a related paper by Rothman (1997) that makes a similar point. However, Rothman does not actually provide estimates of the size of the test. Moreover, his study was limited to the KPSS test with trend, and he narrowly focused on one AR(2) DGP with a root of 0.927, T=175 and l 10.

Sephton (1995) provides slightly more accurate critical values, but the differences are negligible for sample sizes in excess of 100.

M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 645 l. We therefore follow the recommendation of Kwiatkowski et al. (1992) and Lee (1996) and choose a comparatively large value of l such that l=int[12(T/100)1/4].

This choice tended to produce the most accurate test results in previous studies. For the LMC test we set p=1, since the size of the test is not sensitive to the lag order used. The data generating process is an AR(1) process with root r and NID(0,1) innovations. The sample size is T {100, 300, 600}. Table 1 shows the effective size of both the KPSS and the LMC test. Results for the model without trend are shown in Table 1a and those for the model with trend in Table 1b. We focus on the nominal 5% test. Qualitatively similar results are obtained at the nominal 10% level.

Table 1a shows the rejection rate under the null hypothesis for a range of values of r from 0 to 0.99. It is evident that the size distortions for roots near unity are large and increasing, unlike the size results reported in Kwiatkowski et al. (1992), Leybourne and McCabe (1994) and Lee (1996).6 For example, for r=0.9 and T=100, Table 1 Effective size of the Leybourne–McCabe test and the KPSS test of the null hypothesis of stationarity using asymptotic critical values for the nominal 5% level (Based on 20 000 Monte Carlo trials and data generating process yt=ryt−1+zt with NID(0,1) innovations. l=int[12(T/100)1/4] where int denotes the integer part.The asymptotic critical values are from Kwiatkowski et al. (1992).)

0 3.6 5.4 4.4 5.4 4.6 5.2 0.5 5.1 6.2 6.0 5.7 6.0 5.4 0.7 7.3 7.1 7.9 6.0 7.9 5.8 0.8 9.8 8.9 11.2 6.4 10.6 5.8 0.9 18.6 21.4 23.5 7.4 21.7 6.4 0.95 29.5 38.0 45.4 17.9 45.5 8.9 0.98 38.8 49.7 69.1 46.3 76.7 32.1 0.99 40.7 51.9 76.5 57.9 87.9 54.3 We were unable to replicate the size results for the LMC test reported in Leybourne and McCabe (1994) even using the GAUSS code provided to us by Steve Leybourne. Even for processes with low persistence, we ﬁnd much higher size distortions for the LMC test than originally reported.

646 M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 the rejection rate of the nominal 5% LMC test based on conventional asymptotic critical values is 32%. For r=0.99 and T=100, the rejection rate rises to 74%. Even for T=600, the rejection rate may be as high as 70 (45, 14, 9)% for r=0.99 (0.98, 0.95, 0.9). Similarly, the KPSS test rejects the null hypothesis in up to 77% of all trials. Based on this evidence, one would expect both tests to reject the null hypothesis of stationarity far too often in small samples. Qualitatively similar results hold for the model with trend in Table 1b.

It is of some practical interest to compare the performance of the LMC and the KPSS test. For the model without trend, the KPSS test tends to be almost uniformly more accurate than the LMC test for T=100, for T=300 the LMC test is more accurate, except for the most persistent processes, and for T=600 the LMC test is uniformly more accurate. The size of both tests improves with larger sample size, but only very slowly. Consistent with the theoretical results about the rate of convergence of the two tests, the size of the LMC test converges much more rapidly to its nominal level than that of the KPSS test. However, for the relevant range of r, severe size distortions persist even for T=600. For the model with trend in Table 1b, the size distortions of the LMC test tend to be considerably smaller than in Table 1a. Except for T=100, the LMC test almost always is more accurate than the KPSS test, often by a wide margin. The differences are most pronounced for larger sample sizes.

However, even the LMC test has rejection rates of up to 58% for r=0.99 and T=300.

Table 1 also shows that for highly persistent stationary processes, the convergence of the size to its nominal level may be non-monotonic. As the sample size increases from T=100 to T=300, the effective size actually worsens in some cases. For T=600, the effective size improves relative to T=300, but may still be higher than for T=100.

The degree of non-monotonicity is more pronounced for the KPSS test than for the LMC test.

We conclude that both tests have a strong tendency to spuriously reject the null hypothesis of stationarity for realistic values of r and T. The existence of such severe size distortions has not been previously documented in the literature. In applied work, rejections of the stationarity hypothesis based on asymptotic critical values have often been welcomed as strong evidence in favor of a unit root (and as a formal justiﬁcation for pursuing cointegration tests for linear combinations of I(1) variables).

Our results suggest that many of these ﬁndings are likely to have been spurious.

4. Example: testing for long-run PPP in the Post-Bretton Woods era

4.1. Motivation More than 20 years after the breakdown of the Bretton Woods exchange rate system there still is considerable disagreement over the question of whether real exchange rates are mean-reverting (Froot and Rogoff, 1995; Rogoff, 1996). While most economists ﬁnd some version of long-run purchasing power parity plausible and indeed well nigh indispensable in the construction of theoretical open economy macroeconomic models, statistical tests for the absence of mean reversion to date M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 647 have yielded at best conﬂicting results. This makes it appealing to test directly the null hypothesis that real exchange rates are mean-reverting. A failure to reject this null hypothesis would not sufﬁce to convince a skeptic of the existence of long-run PPP, but a rejection would be compelling evidence against long-run PPP. Such PPP tests have been conducted for example by Baillie and Pecchenino (1991) to assess the validity of the building blocks of the monetary model of exchange rate determination for the UK and the US. Kuo and Mikkola (1999) conduct a similar analysis for long-run US–UK real exchange rates. However, their analysis has no direct implications for the post-Bretton Woods period. In work more closely related to ours, Culver and Papell (1999) observe that the failure to reject the null of stationarity for real exchange rates, together with evidence against the null of stationarity for nominal exchange rates for the same sample period, would constitute strong evidence of longrun PPP. Culver and Papell investigate the null hypothesis of stationary real exchange rates in the Post-Bretton Woods era using the KPSS test. For quarterly real exchange rate data, they conclude that the evidence against long-run PPP is mixed, with the KPSS test at the 5% critical value not rejecting the null of stationarity in most cases.

What makes the application of stationarity tests to real exchange rates problematic is the fact that the mean-reversion in real exchange rates is slow. Slow mean reversion does not contradict the view that long-run PPP holds. It is well known that theoretical models with intertemporal smoothing of consumption goods (Rogoff,

1992) or cross-country wealth redistribution effects (Obstfeld and Rogoff, 1995) imply highly persistent but transitory deviations from PPP. Thus, the relevant comparison involves a highly persistent stationary null and a unit root alternative, consistent with our claim in Section 3. One would expect that the accuracy of the test depends on the value of the dominant root under the null hypothesis. Since the extent of the size distortions increases with the persistence of the process under the null hypothesis, it is essential to obtain a sense of the degree of mean reversion under the null in order to assess the potential size distortions in applied work. It is useful to reparameterize this problem in terms of the half-life of the response of the real exchange rate to a shock.

There is a consensus view in the PPP literature about the half-life of the response of the real exchange rate to a shock. For example, Abuaf and Jorion (1990, p. 173) suggest a half-life of three–ﬁve years for the post-Bretton Woods era. Rogoff (1996, pp. 657–658) conjectures that deviations from PPP dampen out at the rate of about 15% per year. Froot and Rogoff (1995, p. 1645) consider a half-life of 3–5 years quite plausible. Recently, Murray and Papell (2000) have shown that the half-life may plausibly be even larger. In what follows, we will exploit the close link between the half-life and the value of the autoregressive root in an AR(1) model,r, to obtain a benchmark for plausible values of r. The half-life of the response of the process to a shock is deﬁned as h=i/f where f denotes the sampling frequency of the data (1/year for years; 4/year for quarters; 12/year for months, etc.) and i is deﬁned by ri=0.5. Under H0, the value of the root r of the AR(1) model yt a ryt−1 (5) t is a function of the half-life. For example, if the half-life of an innovation is ﬁve 648 M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 years under H0 and the data frequency is monthly, r=0.5(1/60)=0.9885. For quarterly data, under the same assumptions, r=0.5(1/20)=0.9659 and for annual data r=0.5(1/5)=0.8704. For the null hypothesis of a half-life of three years, the corresponding values are r=0.5(1/36)=0.9809 for monthly data, r=0.5(1/12)=0.9439 for quarterly data, and r=0.5(1/3)=0.7937 for annual data.

Thus, to the extent that the real exchange rate is well approximated by an AR(1) process, the simulation results in Table 1a suggest that the LMC test will reject the I(0) null with about 70% probability for monthly and about 55% probability for quarterly data, if real exchange rates indeed are stationary with half-lives of about 3–5 years. For the KPSS test the corresponding rejection rates are about 60% for the monthly data and 30% for the quarterly data. Thus, it seems all but impossible to determine in practice whether the test correctly rejects the null in favor of a unit root or whether the rejection is simply due to size distortions. We will illustrate this point in the next section.

**4.2. Empirical analysis**

The real exchange rate data are constructed from the IMF’s International Financial Statistics data base on CD-ROM. They are based on the end-of-period nominal US dollar spot exchange rates and the US and foreign consumer price indices. The ﬁrst data set comprises monthly data for 1973.1–1997.4 (292) observations for 17 countries including Austria, Belgium, Canada, Denmark, Finland, France, Germany, Greece, Italy, Japan, The Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, and the United Kingdom. The second data set includes quarterly data for

1973.I–1997.II (98 observations) for the same 17 countries plus Australia, Ireland, and New Zealand.

We begin the analysis with the LMC test of the null hypothesis of stationarity. The lag orders for the ARIMA(p,1,1) model were selected using the Akaike Information Criterion (AIC). Our results are robust to alternative assumptions about the lag order.

Table 2 provides strong evidence against the null hypothesis of PPP for all countries at the monthly frequency and for all countries but Australia, New Zealand and Switzerland at the quarterly frequency. The apparent ﬁnding of a unit root in all 17 monthly and 17 of the 20 quarterly series is striking in that no other test to date has produced such strong results. If correct, these results would imply that most, if not all, real exchange rate processes contain important permanent components, implying a sharp reversal of the evidence in the literature and a direct rejection of long-run PPP.

It would be tempting to rationalize this result by appealing to theoretical explanations such as permanent changes in the relative productivity of the tradables and nontradables sector (Baumol and Bowen, 1966), permanent changes in the level of government spending (Froot and Rogoff, 1991; Alesina and Perotti, 1995), and systematic bias in CPI measurement (Shapiro and Wilcox, 1997). However, we know from Table 1 that such strong rejections are extremely likely a priori, even if the null hypothesis is true. Thus, the results of the LMC test are not informative. We cannot tell whether the test correctly rejects the null in favor of a unit root or whether the rejection is simply due to size distortions. Moreover, this problem is unlikely to M. Caner, L. Kilian / Journal of International Money and Finance 20 (2001) 639–657 649 Table 2 Testing for long-run purchasing power parity in the Post-Bretton Woods era

be overcome by waiting for more data to accumulate. The size results in Table 1a suggest that even doubling the sample size for the monthly real exchange rate to about 600 observations would do little to improve the accuracy of the test.

The second column of Table 2 shows the corresponding results for the KPSS test.

Recall that our size results in Table 1a suggested that the KPSS test will tend to have lower size distortions than the LMC test for the sample sizes and degrees of persistence of relevance to the PPP debate. This fact is consistent with the observation that in Table 2 there are far fewer rejections for the KPSS test than for the LMC test. However, the number of rejections using the KPSS test is smaller (and that of the LMC test larger) than suggested by the simulation evidence.