# «Innovative Imputation Techniques Designed for the Agricultural Resource Management Survey ∗ † ‡ Michael W. Robbins Sujit K. Ghosh Joshua D. ...»

Section on Survey Research Methods – JSM 2010

Innovative Imputation Techniques Designed for the Agricultural

Resource Management Survey

∗ † ‡

Michael W. Robbins Sujit K. Ghosh Joshua D. Habiger

Abstract

The Agricultural Resource Management Survey (ARMS) is a high dimensional, complex

economic survey which suﬀers from item non-response. Here, we introduce methods of

varying complexity for imputation in this survey. The methods include stratiﬁed mean imputation, the approximate Bayesian bootstrap, and non-iterative and iterative sequential regression. The iterative sequential regression is a form of Markov chain Monte Carlo (MCMC) that is unique in that it allows for ﬂexible selection of conditional distributions while utilizing joint modeling. Each of the regression procedures require data-driven transformations that allow for the implementation of a conditional multivariate normal model.

Key Words: Missing Data, Imputation, ARMS, Markov Chain Monte Carlo

1. Introduction In this paper we consider imputation methods which are applicable to the US Department of Agriculture’s (USDA) Agricultural Resource Management Survey (ARMS). The ARMS is a multi-phase survey which contains 35,000 records of 1,000variables that is administered annually by NASS and the Economic Research Service (ERS), which are both subsidiaries of the US Department of Agriculture (USDA). The current imputation method which NASS uses on the ARMS is an out-dated form of mean imputation which distorting several data attributes. Our goal is to develop a procedure that will maintain all distributional characteristics of the complete data, had there been no missing values.

The ARMS is the USDA’s primary source of information on the ﬁnancial condition, production practices, and resource use of farms, as well as the economic well-being of the nation’s farm households. The scope of the information collected in the ARMS is too large to be further paraphrased here — to quote National Research Council (2008), “No other source aﬀords such a comprehensive view of the American farm.” The ARMS data are indispensable to federal and private sector decision makers when considering policies and programs or business strategies relating to the farm sector.

The complete survey is administered in three phases, and here we concentrate on imputation in the third phase (ARMS III). ARMS III typically has 3-5 versions which are administered in total to about 35,000 farm operations over the contiguous United States.

The Panel to Review the USDA’s Agricultural Resource Management Survey was established in 2006 and was chaired by Bruce Gardner; its ﬁndings are outlined in National Research Council (2008). This reference also provides a detailed overview of the ARMS, as well as the survey design and processing. The research discussed in this paper is the result of the Panel’s recommendations.

∗ National Institute of Statistical Sciences, Research Triangle Park, NC 27709–4006 † Department of Statistics, North Carolina State University, Raleigh, NC 27695–8203 ‡ Department of Statistics, Oklahoma State University, Stillwater, OK 74078–1056

** Section on Survey Research Methods – JSM 2010**

Miller et al. (2010) provide a good outline of the ARMS and its data characteristics as well as a discussion on particular survey aspects that make imputation in ARMS a particularly challenging problem. Here, we paraphrase these challenges.

Due to the large number of ARMS data users, it is essential that no data characteristics (i.e., means, variances, covariances) be distorted by the imputation processes.

The large number of variables within the survey make it particularly diﬃcult to preserve all variable relationships throughout the imputation process. Likewise, it is diﬃcult to preserve the confounding marginal structure of ARMS variables throughout the imputation process. For instance Miller et al. (2010) notes that most ARMS variables are mixed discrete/continuous in distribution. That is, these variables contain a portion of zeros and the remaining portion has a positive continuous density. A skew normal density (Azzalini, 1985) often ﬁts the log of the positive portions. All values which require imputation are known to be positive.

We continue by introducing imputation methods which are applicable to ARMS.

In Section 2 we outline methods that utilize stratiﬁcation, including the current NASS method. In Section 3 we outline transformation techniques which will be required in order utilize regression methods. In Section 4 we outline a non-iterative regression technique which we call sequential regression. In Section 5 we introduce iterative sequential regression, which is a type of Markov chain Monte Carlo (MCMC). Section 6 oﬀers some concluding thoughts.

** 2. Imputation via Stratiﬁcation**

The current NASS imputation procedure involves stratiﬁcation. Hence, the imputation model used may be described as a 3-factor ANOVA table with interaction eﬀects, where the three factors are: 1) Farm Type, 2) Region, and 3) Sales Class.

The data are grouped into cells (or strata), where each cell contains all observations that have each the same value for each of the three factors. If a speciﬁc observation has a missing value for a speciﬁc variable, all observations of that variable in the corresponding cell with a positive and observed value make up the donor pool.

NASS requires that a donor pool has 10 or more values, and if that requirement is not met, fallback groupings are used in order to broaden/merge the cells and to thereby expand the donor pool. See Banker (2007) for an ordered list of the fallback groups, as well as a more detailed description of the NASS and ERS imputation processes. Observed values that are determined to be outliers are excluded from the process.

2.1 Conditional Mean Imputation The current NASS method employs conditional mean imputation. For this method, the impute for each missing value is taken as the mean of the values within the donor pool corresponding to that speciﬁc observation and variable.

The drawbacks of this method are numerous. Most noticeably, conditional mean imputation is well known to distort marginal variable characteristics, primarily by causing a downward bias in classical estimates of variance (see Little and Rubin, 2002; Schafer and Graham, 2002; Fichman and Cummings, 2003; Newman, 2003, among others).

** Section on Survey Research Methods – JSM 2010**

2.2 Approximate Bayesian Bootstrap Imputation The most obvious improvement over conditional mean imputation is a method that imputes with a random draw from a conditional distribution, as opposed to the mean of that distribution. Doing so should alleviate the downward bias in variance estimation. However, proper simulation from the true posterior distribution within each cell is infeasible, since the small number of observations within cells makes it diﬃcult to determine appropriate distributional assumptions. It may be more feasible to impute using a draw from the observed values within that cell.

Approximate Bayesian bootstrap (ABB) imputation (Rubin and Schenker, 1986;

Kim, 2002) accomplishes just that. For this method, donor pools are determined in the same fashion as in the current NASS method. Assume that the k th cell corresponding to the j th variable contains nj,k positive observed values and mj,k missing values. The set of positive values (the donor pool) is denoted Aj,k. ABB

**imputations are generated in two steps:**

ABB imputation is not thought to be proper in the Bayesian sense. Kim (2002) notes that this method induces bias into variances estimates found using MI. However, it does provide a simple method that should show certain improvements over the current mean imputation procedure.

** 3. Transformation Techniques**

In order to integrate sophisticated multivariate models into the imputation scheme, we abandon the stratiﬁed approach and consider linear modeling. For our purposes, this will require normality assumptions, so we now consider transformation techniques that will achieve approximate joint normality.

3.1 Adjusting for the Mixed Variables We adjust for the mixed nature of certain variables by using the following. Assume that Yj, the j th variable, represents a mixed-continuous variable. We break down Yj into two variables, Bj and Yj∗, where

Table 1: The process of breaking down a mixed variable (Yj ) into a fully-observed binary variable (Bj ) and a positive continuous variable (Yj∗ ).

3.2 Transformation of Positive Portions of Variables We now consider the marginal distributions of the Yj∗ ’s. As mentioned previously, the skew normal density often ﬁts the log of the positive portions. A skew normal density contains three parameters: a location parameter (ξ), a scale parameter (ω) and a shape parameter (α). The j th variable will have its own skew normal parameter set, which we denote {ξj, ωj, αj }. If these parameters are known, then skew normal data may easily be transformed into standard normal data. Let F (y|ξj, ωj, αj ), y ∈ ℜ represent the cumulative density function (cdf) of the skew normal variate log Yj. If we deﬁne Tj (y) = Φ−1 (F (y|ξj, ωj, αj )) (2) then Tj (log Yj ) ∼ N (0, 1).

where Φ(·) represents the standard normal cdf. Since the values of ξj, ωj, and αj are unknown for each relevant j, we use MLEs found using available data. An inverse of this transformation may also be easily applied. We refer to the transformation in (2) as a “SN transformation”.

For the j th variable (which may or may not have missing values) we will consider

**one of three possible transformations to create the transformed variables Xj :**

One notable drawback of the stratiﬁed approach is that covariates must be categorical. Inclusion of additional covariates would likely result in having far too many empty cells. In order to incorporate more covariates (in particular, those which are continuous) into the imputation model, we must abandon the stratiﬁed approach and utilize regression techniques.

** Section on Survey Research Methods – JSM 2010**

We continue with our speciﬁc notation which is in accordance with notation introduced in Section 3. Our imputation methods are run jointly on a block of variables. Of the variables in this block, we assume that r are mixed variables and have missing values. These are denoted Y1,..., Yr. We also have q fullyobserved mixed variables, denoted Yr+1,..., Yr+q, and a set of fully observed discrete or continuous variables which are denoted Z. We let p = r + q represent the total number of mixed variables. Of course, as indicated at the end of Section 3, our methods will be applied to the corresponding X1,..., Xp. For our purposes, each of these X’s has missing values, and thereby, in hopes of achieving a nearmonotone missingness structure, they are indexed so that they are increasing in missingness (i.e., X1 is the variable with the fewest missing values). We let B = {B1 ;... ; Bp } and χ = {Z; B; X1 ;... ; Xp }, and for completeness, we write Xj = {x1j,..., xnj }t and Yj = {y1j,..., ynj }t for each j, where n represents the total number of observations.

We now introduce a class of regression procedures that will create imputations for the missing values in the p variables. These procedures are akin to the predictive mean matching technique analyzed in Horton and Lipsitz (2001) and the SRMI technique of Raghunathan et al. (2001) (the initialization step, to be speciﬁc). We will refer to these methods as sequential regression (SR). SR techniques are motivated by the fact that the joint distribution of X1, X2,..., Xp given Z and B can be factored into a sequence of conditional distributions as follows

SR2∗ and SR3∗ 4.1 We let B −j = {B1 ;... ; Bj−1 ; Bj+1 ;... ; Bp }, where the Bj ’s are deﬁned in (1). We assume that, for j = 1,..., p,

ˆ We will ﬁnd the imputed vector, Xj, sequentially for j = 1,..., p. The ﬁrst step in imputing for Xj is to draw values of regression parameters that will be used to ˜ create the imputations. Assuming the model in (6), we let θ j represent a draw from the posterior distribution of θ j found using formulas of the form in Little and Rubin (2002), p. 114. The covariate matrix contains X1,..., Xj−1 (each of which have missing values), but the sequential nature of this procedure allows us to use the imputed versions of these variables instead. Since the response variable, Xj, also contains missing values, we include only observations which have an observed value of Xj when calculating the posterior distribution.

Section on Survey Research Methods – JSM 2010

Most robust procedures (Spiess and Keller, 1999; Little and An, 2004; Von Hippel,

2007) follow the SR scheme we have outlined; however, in order to draw proper imputations using a SR technique, the missingness structure must be monotone.

If the missingness is not monotone, it is possible, for example, that certain unit has a missing value for X1 whereas X2,..., Xp−1 are observed. In this case, the imputed value of X1 would be sampled from P (X1 |Z, B) when the SR technique is used. Doing so may disrupt the relationships (as gauged using the imputed dataset) between X1 and Xj for j = 2,..., p. In order to avoid such a disruption, we must sample X1 from P (X1 |Z, B, X2,..., Xp ). Also, under non-monotone missingness it is diﬃcult to obtain unbiased draws of regression parameters using the SR technique since the covariate matrix used to obtain such draws often contains imputed values (and as we just mentioned, these imputed values may be improperly sampled).