# «EMPIRICAL GROUND-MOTION MODELS FOR PROBABILISTIC SEISMIC HAZARD ANALYSIS: A GRAPHICAL MODEL PERSPECTIVE Kumulative Dissertation zur Erlangung des ...»

The generalization errors for the different classiﬁers/regressions are displayed in Table 6.2. As one can see, the naive Bayes classiﬁers perform consistently better than regression, shown by a lower generalization error. Table 6.2 also shows that both PGV and PGA are predictors of similar quality when using the naive Bayes classiﬁer, while for the regression PGV performs better (similar to Boatwright et al.(2001)). We also see that a combined predictor of PGV and PGA does not improve the predictive performance of either naive Bayes or regression. This can be interpreted that the information content of PGA and PGV with respect to seismic intensity is similar. In the case of the naive Bayes classiﬁers, there is also no difference between the ones with a common standard deviation and those with a different one for each intensity class. This indicates that both classiﬁers generalize equally well to unseen data. However, we believe that it is preferable to use a common standard deviation, since for some intensity classes there are only a few data points, which might render the estimation of the standard deviations unstable.

In Figure 6.1, we show predictions of I given PGV. Here, for each PGV value the full condir(I r(I P GV ) is shown, color-coded from light (low P GV )) to dark (high tional distribution r(I P GV )) colors. For each value of PGV on the x-axis, the corresponding color-coded values r(I P GV ) along the vertical (I) axis sum up to unity. For comparison, we also plot the data of points as well as the geometric means of the PGV values for each intensity class. The latter are Discussion and Conclusions

used for the regression of the model of Faenza and Michelini (2010), which is also shown in Figure 6.1. As one can see, the most likely I predicted by the naive Bayes classiﬁer (i.e. the intensity r(I P GV ), corresponding to the darkest color for each intensity class) class with the highest correlates reasonably well with the model of Faenza and Michelini (2010).

** Figure 6.1 also shows the large scatter in the data, both for a given PGV value as well as for a given intensity value.**

This is very well represented by the naive Bayes classiﬁer, which returns r(IP GV ). The large scatter in intensity values for a particular a relatively broad distribution PGV value indicates that it is important to treat I probabilistically, i.e. use the full distribution.

This is facilitated by a naive Bayes classiﬁer.

**6.4 Discussion and Conclusions**

We have presented naive Bayes classiﬁcation to predict intensities from ground motion intensity parameters (PGA and PGV) as an alternative to traditional regression models. A naive Bayes classiﬁer predicts the distribution of a discrete variable given some predictor variables using Bayes’ rule, making the naive assumption that the predictor variables are conditionally independent given the target. This assumption greatly reduces the number of parameters to learn and is, albeit not realistic from a physical perspective, often sufﬁcient for prediction. In our case, the assumption of conditional independence only applies if we use both PGA and PGV as predictors (and it applies to regression as well). From a purely physical perspective, this assumption is not justiﬁed, since there is correlation between PGA and PGV, but analysis of the generalization error (see Table 6.2) shows that the naive Bayes classiﬁer nevertheless outperforms regression when it comes to prediction of seismic intensities from PGA and PGV. The naive Bayes classiﬁer, however, is not suitable Discussion and Conclusions as a physical model for the data generating process.

r(I = k P GV; P GA), making the assumpWe have built a naive Bayes classiﬁer to estimate tion that the conditional distribution of PGA and PGV, respectively, given an intensity class, is log-normal. The analysis of the generalization error, estimated via leave-one-out cross-validation, shows that the naive Bayes performs better than regression when it comes to predicting unseen data. The generalization error also shows that PGV and PGA individually can both predict I similarly well, while the joint use of them does not lead to an improvement in prediction. Incidentally, we believe that this is due to the high correlation between PGA and PGV, which means that one can be used as a surrogate for the other.

A particular appealing feature of the naive Bayes classiﬁer is that it provides a direct estimate of r(I P GV; P GA). Compared to regression, there is no roundthe discrete intensity distribution ing or interpolation necessary, meaning that directly integer values are estimated. Since Bayes’ r(I r(X P GV; P GA), an estimate of I) is rerule [eq. (6.2)] is used for the estimation of quired, where X is either PGA or PGV. Thus, the model can be just as easily used to predict the ground motion parameters given I.

We have learned two naive Bayes classiﬁers, one with a common standard deviation of the distribution of the ground motion intensity parameters over the different intensity classes, and one with different standard deviations. Even though both classiﬁers have a similar generalization error, we believe that it is better to use the former, since it provides a more stable estimate of the standard deviation. For some intensity classes, there are only 3 or 5 records, which makes it difﬁcult to obtain a precise estimate of the standard deviation. Other possibilities exist, e.g. one could estimate a common standard deviation for adjacent intensity classes, which is done in Ebel and Wald (2003). Nevertheless, we think that the assumption of a common standard deviation over all intensity classes is reasonable.

In contrast to a regression model, which is unbounded, the naive Bayes classiﬁer can only predict intensity values which occur in the underlying dataset. In principle, one could extrapolate a regression model to ground motion intensity values that lie beyond the extreme values found in the dataset to predict higher/lower intensity values (e.g. intensity values greater than 8 for the current dataset). This is not possible with a naive Bayes classiﬁer. However, it is questionable if this is a disadvantage, since extrapolation of a model outside the parameter boundaries of its underlying dataset can be dangerous (see e.g. Bommer et al. (2007), for a discussion of extrapolating ground motion prediction equations).

The naive Bayes classiﬁer that was learned in this study is trained on a dataset consisting of macroseismic intensities (of the Mercalli-Cancani-Sieberg scale) and PGA/PGV values from Italy, which is the same dataset used in Faenza and Michelini (2010) (see Data and Resources section).

The reason why we have chosen it was because of the good documentation of the selection and preprocessing steps. We do not claim that this automatically justiﬁes the application of their or our model in other regions which is an issue which requires careful consideration of a number of arguments (e.g. Cotton et al., 2006; Bommer et al., 2010). Certain ground shaking levels are bound to cause damage everywhere in the world, but since macroseismic intensity is a somewhat qualitative parameter that may include information on building quality, exact values/distributions might change from region to region. On the other hand, if the goal is to predict the most probable intensity value, the model may well be applicable in other regions, since this task is probably less Discussion and Conclusions sensitive to regional inﬂuences. As said before, we leave this issue up to the user.

In this short note, we have considered PGA and PGV as predictor variables for I. Of course, the model can be extended to include also other variables such as magnitude or distance (e.g.

Tselentis and Danciu, 2008). In that case, however, it is not as easy as in the case of PGA and PGV to assume a parametric distribution for each intensity class. Thus, either these variables need to be discretized, or some other method such as a Kernel density estimation needs to be employed.

Such an analysis, however, is beyond the scope of this article.

Data and Resources The dataset used in this study is the one compiled by Faenza and Michelini (2010), which is available in their electronic supplement under http://www3.interscience.wiley.

com/journal/123266793/suppinfo.

Acknowledgments We acknowledge that this paper was helped by the discussions in the Pegasos Reﬁnement Project workshops. We thank the reviewers Fleur Strasser and Karen Assatourians and the editor Arthur McGarr for their comments which helped to clarify and improve the manuscript.

## GENERAL CONCLUSIONS AND PERSPECTIVES

In this work, we have looked at uncertainty in GMMs. During that process, we also investigated some other questions that are of interest in the context of GMMs and PSHA, such as correlation between ground motion intensity parameters or regional differences in ground motion scaling.A considerable amount of uncertainty that is associated with GMMs pertains to their functional form f(X) (cf. eq. (1.2)). Often, f(X) is determined based on physical considerations and the analysis of residuals. In chapter 2, we have taken a new stance and based f(X) on its predictive capability over the generating dataset. Therefore, we introduced the concept of generalization error and cross-validation. The idea here is that for PSHA, the primary goal of a GMM is not to to be a model of the physical processes in the ground motion domain, but to accurately predict future expected ground motions. Therefore, a GMM should be oriented along the lines of its predictive power. In this context, see also Breiman (2001a).

Based on the above considerations, a regression model is learned based on the NGA dataset which is optimized for its predictive capability. The model is rather complex (having many parameters), but is not overﬁt. We have calculated an equivalent stochastic model, which is physically interpretable (and also plausible, compared with already published models for western North America). Thus, the method we proposed is a convenient way to optimize a regression model for predictive power and checking that it makes physical sense.

A real physical interpretation is possible only for the equivalent stochastic model, since the parameters of the regression model are not tied to any physical meaning. However, partial dependence plots can reveal several characteristics of the model/data (cf. Figure 2.4, eq. (2.7) and Friedman (2001)). For example, in the partial dependence plot showing the scaling of PGA with SHkm WHkm.

distance there is a ‘bump’ visible in the range between RJB = and RJB = This ‘bump’ can be associated with the so-called Moho-bounce. This effect is not modeled in the NGA models, but our analysis shows that it is supported by the data. Hence, the ﬂexible, generalizationerror optimized model shows which features are actually inherent in (or supported by) the data and can thus be helpful in choosing a functional form that models these features, thereby reducing uncertainty about f(X).

On the other hand, the partial dependence plots also show data ranges which are problematic.

In particular, this holds for the magnitude and the depth to the top of the rupture. Since there are typically fewer earthquakes than records in a strong motion dataset, these two variables are less well sampled than e.g. distance, and thus the scaling of ground motion with them is less clear deﬁned by data. This manifests itself in ‘rougher’ partial dependence plots for the earthquake related variables. The overall scaling makes sense, but in some ranges is overly complicated. This reﬂects that the underlying dataset – at least for the earthquake related variables magnitude and depth to the top of the rupture – is not a representative sample of the true underlying distribution.

Thus, for these variables the model can only provide guidance on the general form of ground motion scaling.

In chapter 2, we have optimized the model with respect to generalization error for the moment QH magnitude, Joyner-Boore distance, VS and depth to the top of the rupture – the faulting style is included in the model, but is not adapted. One could also include other variables, such as directivity parameters or sediment depth, to investigate their (functional) relation to ground motion. One could also use other basis functions than polynomials, such as splines. Non-parametric regression methods such as MARS (multivariate adaptive regression splines, Friedman (1991)) or random forests (Breiman, 2001b) also may provide viable insights.

Along the same lines as the ﬂexible regression model, we used Bayesian networks (BNs) to investigate what can be learned (purely) from data. The BN is a representation of the joint disr(PGA;X).

tribution of PGA and the (potential) predictors X, Here, results are slightly complicated due to the need of discretizing the data, but again we ﬁnd that there are problems in the underlying dataset – several data ranges are not well sampled. However, in ranges with good data coverage the BN gives reasonable results. One example of a possible not well represented data U:S.

range is the scaling of PGA with very large magnitudes (MW Here, a decrease of PGA with increasing magnitude is observed (so-called oversaturation). This is also seen in the NGA models (Abrahamson and Silva, 2008; Boore and Atkinson, 2008; Campbell and Bozorgnia, 2008), but it was decided not model this effect due to a lack of scientiﬁc consensus on that matter.