# «Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

1 λ = TF* = µ + φ DF = µ 1 α = µ·(µ + φ) / φ = µ·(µ/φ + 1) Solving the quadratic inequality (55) for positive µ and φ and parameterizing with f, we obtain φ µ2/(1-µ) µ [φ2 + 4φ]-1/2 / 2 φ = µ2/[f·(1-µ)], α = f·(1-µ) + µ, 0 f 1 Hence a minimal variant of ‘add φ’ smoothing would be to set documents with such expected levels of occurrence and one actual occurrence to 1+φ/µ. In this model, setting f = 1 violates the model and even f = 0.5 is not conservative enough as it states that half the nullcontexts are actually relevant, however a good search term has µ0.5 as the appropriate estimate of the probability of occurrence in a context whose relevance is not known, and f = µ leads to the definition of φ as the odds ratio ρ φ = ρ = µ/(1-µ), α = µ + µ·(1-µ) = 2µ-µ2, γ = (2-µ)/(1-µ); β = 1/(1-µ); δ = 1/(2-µ) where the increased estimate for β becomes our increased estimate of frequency when TF = DF, and all non-zero frequencies are 1. Note that β 1, γ 2.

When TF = DF = µ, there is no distinction between relevant and nonrelevant contexts and our definitions are useful only to describe a zeroth order model Gº in which the probability of occurrence in any context, P0 = µ, is equated with the probability of occurrence in a relevant context, δ. To bootstrap a first order model that takes into account relevance of a context, we estimate the true probability of null contexts that are relevant with no occurrences as f=µ, the observed proportion of contexts that are non-null and hence relevant but have no recurrences, giving us the parameterization in (59) and estimates of α and δ that satisfy 0µαδ1 for all values of TF and DF.

Thus one approach to ensuring accurate demeaning of the non-null contexts is to reestimate β and γ as above when TF=DF and then for all cases subtract β from the actual frequencies, being our estimate of the expected number of occurrences in a relevant context (modification β).

The error introduced by modification of each non-null context is thus β – µ 0, while an equal error of opposite sign is introduced by modification 2 of each null context. An argument could also be made for subtracting γ, as our estimate of the expected number of occurrences in a non-null context (modification γ), however this is unfairly using the single occurrence to define a relevant document and then discounting it as an occurrence – the question of relevance must be decided independently for γ to be a valid estimate.

Note that applying Good-Turing smoothing or Katz-backoff across the different term types will tend to give a decreased estimate of the frequency of such rare terms drawn from an open class as assumed for search terms (Gale, AAAA), and its underlying exponential-form is not necessarily appropriate for smoothing across types that obey Zipf’s Law (Samuelsson, 1996; Gale, AAAA), although Samuelsson’s ‘deceptively similar’ modification that conforms to the Zipfian distrigution may be. However, the strict version of Zipf’s Law implies a finite lexicon in contradiction of our notion of open class and needs to be understood as a sum of distinct distributions with different characteristics (Powers, 1998), and Zipf (1956) also noted that it was dependent on document size.

The implication of point 3 is that we need to go beyond the occurrence count to determine whether a context is relevant – there are approaches based on word similarity using techniques such as LSI or WordNet that give an independent estimate of the probability of a specific context being relevant. If we are using SVD to calculate the LSI, then we have a recurrence problem: we could come back with a revised estimate and recalculate everything.

Note that log(1+x) ≈ x for small x. Hence it is sufficient to replace the splog value for unit occurrences by -γ to implement the modification for TF=DF, representing the lower than expected occurrence counts.

More generally, log(k+x) ≈ log(k) + x/k so that to achieve modification µ resp. α, β or γ it is sufficient to decrement the splog value by µ/k resp.

α/k, β/k or γ/k. We will refer to the original model as splog0, and the models implementing the respective modifications as splogµ, splogβ, etc. Model splogµ is both sparsity- and occurrence-preserving, whilst splogβ may introduce additional zeros when β is integral, and the other two are not usable in the absence of independent information about relevance.

The best choice is arguably splogµ in terms of guaranteeing preservation of both sparsity and occurrence information. Moreover, splogµ introduces considerably fewer errors and hence a much lower sum-of-squares error in the high-sparsity expected and desirable distributions for useful search terms in which the number of null contexts (splogβ errors) vastly exceeds the number of null contexts (splogµ errors). In addition, splogµ restricts the effective measure of central tendency to the range between median or mode and geometric mean (see Eqn 53).

Note that the negative values that occur when splogging a frequency that is less than expected could actually be useful in a non-positive condition for use in implementing the Boolean NOT operator – it is arguably better say the document has a lower than expected occurrence rate rather than to force it to have zero occurrences.

Parameterization of the P-mixture We perform a similar analysis to see how TE handles the pathological cases. As noted above, the K-mixture and P-mixture are not well defined in this case as γ=1, β=δ=0, violating our constraint that 0 δ1 (and α and η are undefined as a result).

Analogous to the corrections in the previous section, we can propose two simple corrections to the K-mixture: 1. setting γω=1.5 in the specific case where TFw=DFw; or 2. incrementing γ in all cases using γω=γ+0.5. In both cases identities 28 and 30 to 32 can be used to define βω, δω, εω and ζω, from γω alone. These are clearly no longer consistent with the empirical values for TFw and DFw so we give priority to DFω ≡ DFw and treat TFw as being a truncated estimate of the true value TFω ≡ γωDFw according to identity 34. Identities 27 and 33 may now be used to define αω and ηω. Note that whereas previously we considered an ‘add φ to µ’ model, here we are parameterizing as an ‘add 1/κ to β and γ’ model, recalling γ estimates TFwd and TCwd.

Elaborating the two simple cases and asserting condition 55 as a constraint, we get λ = TFω = 1.5 · DFω, µ = DFω = DFw = 2 · φ ⅓ if DFw=TFw λ = TFω = [TFw/DFw+0.5] · DFω, µ = DFω = DFw = 2 · φ ⅓ More generally, we can reduce the increment in line with the specificity

**of terms κ:**

λ = TFω = (1+1/κ) · DFω, µ = DFω = DFw = φκ 1/(κ+1) if DFw=TFw λ = TFω ≡ TFw + φ = [TFw/DFw+1/κ] · DFω, µ = DFω ≡ DFw = φκ 1/(κ+1) Now equations 26 to 33 define the κP-mixture for these smoothed models. Note that in the DFw=TFw case, we have β = λ/µ – 1 = κ (62 or 63). Furthermore, with κ = 2 (Eqn 61 ≡ 63) ζ corresponds in value to γ in the original K-mixture and the κ=2-parameterization of the κP-mixture leads to ζ-parameterization in terms of ζ = TFw/DFw, µ = DFw, referred to as the ζP-mixture, whereas the Kmixture is parameterized in terms of γ = TFw/DFw, µ = DFw representing the limit as κ approaches infinity, and can be also be identified as the γP-mixture. The special case κ = 1 gives φ = µ, β = TFw/DFw and a standard ‘add one’ model, referred to as the βP-mixture.

With these models we are setting a fixed φ = µ/κ subject to µ 1/(κ+1) and equivalently κ (1-µ)/µ (both forms following from substitution of φ in Eqn 56). This condition must hold for µ=DFw for all terms w so we must set κ on the basis of the greatest value of DFw observed in the corpus. We may use this parameter either for just the DF=TF case (Eqn 60 or more generally 62), or for all cases (61 and 63).

Alternatively, if we use f = µ, φ = µ/(1-µ) to define the parameterization (59 or 63 with κ = µ/φ = 1-µ 1), the mixture is self-adjusting and automatically satisfies κ (1-µ)/µ = 1/φ = κ/µ and thus µ 1/(κ+1). Observe that this also forces κ 1 whereas in the constant κ-parameterizations we have normally tried to set κ ≥ 1. In the limit this auto-parameterization approaches the φ = µ, κ=1 (‘add one’) case as µ approaches 0 (corresponding precisely to those rarer terms where TFw=DFw is more likely, and in which limit we find TFω=2DFω). In this case of small µ we can also approximate 1/κ = 1/(1-µ) ≈ 1+µ and hence φ ≈ µ(µ+1) = µ + µ2, being early terms of the GP that defines the Gº model.

The question is whether (63) is a theoretically reasonable generalization of (62) for either the constant or the self-adjusting parameterization using κ = 1-µ. In particular, TFwd rates will occur for conditions other than DFw=TFw and must occur for TFw 2DFw. In these cases the TFwd also represent underestimates of the actual probabilities according to our assumptions about content words recurring in relevant contexts (and so defining bursts or clusters).

Assuming the validity of the TFw=DFw reestimate (62) and now considering a case where the expected count for term w, TFw=DFw+k.

According to our argument in defining this estimate, the probability of unexpressed occurrence or recurrence has decreased, and there will be exactly one document containing two occurences for the case k=1, so we still find f=µ. For µk1 it is possible to have only documents with one or two occurrences for which f=µ again follows, but it is also possible to find larger clusters which we would however argue is consistent with the same distribution but simply further from the mode.

We now analyze the role of the probability of a document being relevant given there are zero occurrences varies with k, defining this as pk satisfying α = PD(d Я w) = PD(d ∋ w) + PD(d Я w | d ∌ w) PD(d ∌ w) = µ + pk(1µ) By some simple arithmetic manipulations we see α = [(TF+φ) · µ] / [(TF+φ) – µ] = µ + µ2/[k+φ] = (1-µ)/[k(1-µ)/µ2 + 1/f] This gives us a number of interesting and insightful relationships involving the odds ratio ρ = µ / (µ-1) and in particular ρµ (which is

**constant for a given w or µ = DFw):**

pk = ρµ/[φ+k], φ = µ2/[f·(1-µ)] = ρµ/f 1/pk = 1/f + 1/fk, fk = µ2/[k·(1-µ)] = ρµ/k, f = fφ = µ2/[φ·(1-µ)] = ρµ/φ Recalling that interval scales are often more appropriate and amenable to statistical manipulation than the reciprocal frequency and probability scales, we see that f = fφ defines the k=0 proportion of the null documents, and k specifies increments in terms of the interval φ with the smoothing technique being clear seen as of ‘add one’ character on this interval scale. Moreover, although (67) is apparently only well defined for k0, because of being defined in terms of probabilities, this really only illustrates again that the fundamental events are occurrences defining intervals, and that probabilities are not always the most convenient averages to use as the basis for a model. Thus the φ and 1/f terms go to 0 in the K-mixture limit, and the k and 1/fk terms go to 0 in the TF=DF limit, and when both limiting conditions hold we confirm that the K-mixture is not well defined for TF=DF.

What this shows is that whatever justification we have for the k=0 case (TF=DF), it carries over sensibly to k0 cases in a manner highly reminiscent of Zipfian smoothing methods rather than Good-Turing or the original Katz (1987) backoff scheme, where Good-Turing/Katzbackoff smooths towards an infinite geometric distribution over types that is similar to Gº but Zipf’s Law implies a finite inverse linear distribution and both of these appear to define bounds on the true distribution rather than reflecting the distributions directly (Samuelsson,1996; Powers,1998). Thus we conclude that the generalization of (62) to (63) is appropriate for any reasonable definition of κ, and precit that f=µ, κ=1-µ will be the better than constant κ-parameterizations.

Note that Katz (1996) both defined and tested models using the same corpus and aggregations of terms with similar statistics, whereas the correct way to test the model is to parameterize it using empirical probabilities drawn from a training set and to validate the model and compare models using an independent validation set and then publish results for the selected optimum model for a third independent test set.

The partitioning may be done on the basis of documents or terms, or preferably both.

We called the κ=1 and κ=2 parameterization according to equations 61 and 63 the β- and ζ-parameterizations defining the β- and ζP-mixtures as discussed above. We define the κ=1-µ parameterization according to equation 63 the µP-mixture, and refer to other parameterizations with a constant κ as κP-mixtures. Equation 62 doesn’t define a true parameterization but rather a correction of the one specific degenerate case of the K-mixture, and we thus refer to these as κP-correction with the κ=1-µ, µP-correction being the recommended correction. Note that the constant κ parameterizations and corrections are subject to the constraint µ 1/(κ+1), and introduces an additional parameter, whilst the κ=1-µ constraint of the canonical P-mixture or P-correction does not introduce an additional parameter. It is now a matter for empirical investigation as to which of these four models is best, and in the case of the κ-parameterization or κ-correction, which value of κ, as well as demonstrating that they are better than the degenerate zeroth order Gº model that arises when TF=DF (this must be demonstrated in application to unseen text where this may or may not hold). The autoparameterized P-mixture may be expressed directly as λ = TFω = [TFw/DFw+1/[1-µ]] · DFω, µ = DFω = DFw = φ[1-µ] 1/(κ+1) The discussed variants of the κP-mixture defined by equations 26 to 33 and 63 (replacing 34 and 35) may be summarized as follows, noting that ρ is the odds ratio µ/(1-µ) = f/κ, that φ = µ/κ = µρ/f, and that 0 f 1 is required for a valid model so that the γP-mixture is the ill-defined unsmoothed K-mixture defined by µ = DFw and one of γ as empirical

**ratio γo = TFw/DFw, δ as δo = 1 – 1/γo or ε as εo = 1 – δo = 1/γo:**