# «EMPIRICAL GROUND-MOTION MODELS FOR PROBABILISTIC SEISMIC HAZARD ANALYSIS: A GRAPHICAL MODEL PERSPECTIVE Kumulative Dissertation zur Erlangung des ...»

Since the dataset was originally compiled to reconstruct ground-shaking from recent-historical earthquakes (Allen et al., 2009) using USGS ShakeMap methodology (Wald et al., 1999), the two horizontal components are combined by taking the larger horizontal component.

**Almost all published GMMs make the assumption that ground motion is log-normally distributed. We follow this assumption and introduce a new variable, Z, which is the natural logarithm of our target variable:**

**lnPGA:**

Z= (5.6) We compose our model as a multilevel (or hierarchical) model (Gelman and Hill, 2007). The different levels allow to take into account grouped data. There are three levels in our model: the Ground Motion Model Setup station level (index s), corresponding to all records recorded at the same station; the earthquake level (index e), comprising all records from the same earthquake; and the region level (index r), where all records, stations and earthquakes of the different regions are grouped. The intersection of the earthquake and station level represents the record of the eth earthquake recorded at the sth station. This setup allows to consider the correlation of records from the same earthquake or same station, which is usually taken into account by using an appropriate regression technique, such as a one-step or two-step regression (Joyner and Boore, 1993, 1994) or a random effects algorithm (Abrahamson and Youngs, 1992). There exist also techniques to deal with inter-station variability (e.g. Chen and Tsai, 2002), though it is rarely considered in published GMMs. A multilevel model can be thought of as conceptually similar to a two-step regression with the ability of easily adding extra complexity.

** Figure 5.4 depicts the graphical model corresponding to our GMM.**

Here the three levels are represented by the three plates (rectangular shapes). The outer plate (loop over r from 1 to Nregion ) corresponds to the different regions (cf. Figure 5.3). It embraces an individual GMM with individual parameter vectors ar, br and cr, which denote the parameters of the earthquake, record and station level, respectively. However, even though there are distinct parameters for each region, these are connected. Each regional parameter is sampled from a corresponding global normal

**distribution:**

r ∼ N ( ; ); (5.7) where r ∈ {ar ; br ; cr }. Thus, the parameters in each region are connected by global hyperparameters a, b, c, a, b and c. The width of the global distribution is a measure of the regional differences of the parameters. The global hyperparameters and allow the model partially pool the data from the different regions together such that data from all regions is used to estimate the coefﬁcients in one individual region, but with different weight. For a very brief introduction to the ideas of multilevel modeling, see the Appendix. For more details, see Gelman and Hill (2007).

As mentioned above, there is an individual GMM for each region r inside the region plate of Figure 5.4. The graphical model corresponding to the individual GMMs can be thought of as a conceptual model of the data generating process. This is explained subsequently in more detail.

**The concept of the model is as follows:**

The central node for the individual GMMs of each region r in Figure 5.4 is Zesr, which is the eth earthquake recorded at the sth station and corresponds to an observation of the the target variable.

Zesr is distributed according to a normal distribution with mean Z;esr and standard deviation r.

The observation mean value Z;esr is the sum of an event term, a station term and a record term, as shown in eq. (5.9). The event term Eer is common to all records from the same earthquake e and is itself distributed according to a normal distribution with mean E;er and standard deviation r.

Coorespondingly, the station term is the same for each record from the same station. In principle, Ground Motion Model Setup

** Figure 5.4: Graphical model for the global multilevel ground motion model.**

Ground Motion Model Setup it would be possible to assume that the station term is also a normally distributed random variable.

However, estimation of the standard deviation of this distribution requires stations with multiple recordings, which are not abundant in our dataset. Hence, we assume that the station term is not a random variable, but a constant (strictly speaking, we assume that its standard deviation is zero).

r and r are the within-event and between-event standard deviations, respectively.

The means of the event, record and station terms are functions of parameters and the predictor

**variables:**

We have settled on the following functional forms for f, g and h. These are based on geophysical considerations and generalization error determined by 10-fold cross-validation (e.g. Kuehn et al., 2009; Hastie et al., 2001).

FR and FN are dummy variables taking the value one for reverse and normal faulting, respectively, and zero otherwise. SA and SS are dummy variables equaling one for stiff soil and soft soil, respectively, and zero otherwise. H(x) is the Heavyside-function which equals one for x ≥ H and zero for x H.

We have settled for a trilinear magnitude scaling instead of a quadratic magnitude term since it results in a slightly lower generalization error and allows more control over the magnitude scaling.

It also effectively decouples the large magnitude scaling from the small magnitude scaling. The ln RRup;esr + bP, is chosen because P term for the geometrical spreading, (bH;r + bI;r ∗ MW;er ) ∗ P;r the results of the initial regression are more stable than when using a magnitude term inside the root, similar to Campbell and Bozorgnia (2008), and it gives a lower generalization error. We include an anelastic attenuation term, bQ;r ∗ RRup;esr, since the maximum distance in the dataset is 400 km (cf. Figure 5.2).

QH As one can see, VS and the focal mechanism F are displayed as stochastic nodes in Figure 5.4.

This is due to the fact that some of these values are missing, i.e. there are some stations without QH an associated VS value and some earthquakes with unknown focal mechanism. We treat these unknown values simply as parameters - they are assigned a prior distribution, which is updated QH by the likelihood resulting in a posterior distribution for each unknown VS or F value. This is Ground Motion Model Setup

a convenient side effect of the model, which thus provides a principled way to deal with missing data.

In Figure 5.4, prior distributions are needed for those parameters that are displayed as a stochastic node (i.e. as a circular node) that have no parents. In our case, these are the parameters of the global parameter distributions, i.e. the global hyperparameters a, b, c, a, b and c. The speciﬁcation of the prior distributions used in the present work is explained below.

5.5.1 Prior Distributions

parameters such as average stress drop or quality factor QH. Therefore, our strategy for specifying

**prior distributions is as follows:**

1. Specify prior distributions on the stress drop, QH and the slope of the geometric attenuation.

2. Generate a synthetic dataset from these parameters using stochastic simulations (Boore, 2003).

3. Regress the function of eqs. (5.14) to (5.15) on the synthetic dataset.

4. Take the estimated coefﬁcients as prior means for the ’s For the stress drop, we assume a truncated normal distribution between 10 and 100 bar with mean 40 bar and standard deviation 20 bar, based on Allmann and Shearer (2008). The prior distribution for QH is a truncated normal distribution between 10 and 1000, with mean 250 and standard deviation 50. We also set a prior distribution on the slope of the geometrical spreading, which is a truncated normal distribution between -1.5 and -0.8 with mean -1 and standard deviation 0.2.

The above values are consistent when equivalent stochastic model parameters are determined for several GMMs from shallow active tectonic regions using the method of Scherbaum et al. (2006).

The stochastic simulations are carried out with the program SMSIM (Boore, 2005).

The prior standard deviations of the ’s are chosen so that the distributions are very wide. This allows the data to play a dominant role. Here, the standard deviations of the ’s are chosen to yield a ration between standard deviation and mean of 10.

The above mentioned approach can be used to determine prior distributions for the parameters a0, a1, a2, a3, b0, b1, b2 and b3, as well as and . For a4 and a5, which describe the scaling of the ground motion intensity parameters with focal mechanism, we use Table III from Bommer et al. (2003). The site effects parameter c1 and c2 are assigned distributions based on guesses by the authors. These are loosely oriented on published GMMs as well as the work of Dobry et al. (2000). The means of the between-event and within standard deviations are also assigned based on published values. The standard deviations of all global distributions are assigned a broad, noninformative uniform distribution between 0 and 20. In Table 5.4 the prior distributions for the mean values are shown.

QH As described in section 5.5, we also treat missing VS and F values as unknown parameters, QH which are assigned a prior distribution. VS as well as F are categorical variables with three states, and we use a uniform prior over the three states for both of them.

**5.6 Results**

In the previous sections, we have speciﬁed the model, the prior distribution and the dataset. Using Bayes’ rule, we can now estimate the posterior distribution of the parameters given the data, r(¢ r(¢ D). As described before, the model is too complicated to estimate D) analytically, so we resort to approximate inference, using MCMC sampling to obtain samples from the posterior distribution of each parameter. From the sequence of samples we can compute several summary statistics, such as means, standard deviations, quantiles and so on. The histogram of the sampled values serves as an approximation to the posterior probability density function.

Results 1.5

** Figure 5.5: Histogram of sampled values (i.**

e. approximation of the posterior distribution) for the parameter a1.

r( In Figure 5.5, the approximate posterior distribution of the parameter a1, a1 D), is displayed. The histogram shows clearly that a1 can only be determined with considerable uncertainty. This uncertainty should not be neglected, thus highlighting the beneﬁt of the Bayesian approach. Since there are too many parameters in the model to display posterior histograms for all of them, we provide these in the electronic supplement.

In Figure 5.6, we show a summary of the distributions of all parameters. Here, the 5%, 50% and 95% quantile of each parameter’s posterior distribution is shown, i.e. a 90% conﬁdence interval.

The conﬁdence intervals are shown for the means of the global distributions as well as for the parameters of each region. As one can see in Figure 5.6, the parameters that belong to the event term, ar [eq. (5.14)], are in general associated with a higher uncertainty (illustrated by a larger conﬁdence interval) than the record and site related parameters br and cr. The latter ones also display higher variability between the regions. This is discussed further in the next section.

In Figure 5.7, we show the residuals of the data with respect to the global parameters. Since the result of the analysis is a distribution for each parameter, not a single point, we calculate the residuals in the following way: For each data point, 100 parameter sets are sampled from the posterior distributions of the global parameters. With these sets of parameters, a PGA value is predicted according to eqs. (5.14) to (5.16), and the residual to the observed value is computed.

Thus, there are 100 residuals per data point. The distribution of the residuals, both between- and within-event residuals, are then shown in Figure 5.7. As one can see, there is no obvious bias in the residuals.

Analogous to Figure 5.7, we show a residual distribution (again for 100 parameter sets for each data point) in Figure 5.8, but this time with respect to the regional parameters. Here, for each data point the parameters are sampled from the respective regional posterior distributions. Again, we see that there is no apparent bias in the residuals. The regional residual distributions are also narrower than the global ones in Figure 5.7, in particular for the between-event residuals.

Results

** Figure 5.8: Residual distributions, calculated with the regional parameters; (a) between-event residuals; (b) within-event residuals.**

The mean values of the between- and within-event residual distribution for each data point are plotted against magnitude and distance in Figure 5.9, respectively. Figure 5.9 (a) and (b) shows mean residuals with respect to the global parameters, while (c) and (d) are calculated using the residual parameters. There is no obvious trend with magnitude ((a) and (c)), while for larger distances the model underpredicts the data when using the global parameters (b). By contrast, when the residuals are calculated with the regional parameters (d), there is no trend with RRUP.

In Figure 5.10, we compare the mean residuals for each of the individual regions using the regional and global parameters. Here, we see that the residuals with the regional parameters are lower over the individual regions. Thus, even though the residual distribution for the whole dataset is similar when using the global or regional parameters, within the individual regions there is a better ﬁt with the regional parameter – not unexpected, since the regional parameters are primarily determined by data from the speciﬁc regions.

Discussion and Conclusions