# «Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

TEwd = -log(Pw(TFwd)) = -log(δ)*TFwd –log(η) = -log(DFw/α)*TFwd – log(DFw/β) Rearranging, TEwd may be viewed as a combination of the simple TFIDF and IDF models, or a smoother ‘add one’ TFIDF model, with the subtraction of redundant information as revealed by a simple ITF model, since IDF ignores extra occurrences of w and ITF models the overestimated entropy of the expected β extra occurrences of w amongst the λ=TFw occurrences treated as introducing new relevant contexts (37 & 38).

TEwd = TFwd*IDFw + IDFw – ITFw = (TFwd+1)*IDFw – ITFw where ν = βαλ ITFw = -log(β) – TFw*log(α) = -log(ν) There is however some fine tuning that will performed later when we present our parameterization models – in particular TEwd is not well defined in the K-mixture when TFw = DFw (e.g. at most one occurrence of word w in any document).

Note that there are many different notations used for the probabilities and expected values discussed here, that statisticians tend to use the term frequency to indicate a count irrespective of the total sample or corpus size, where as in natural language processing we are more interested in the stricter definition of frequency as a rate of occurrence relative to some specific sample size or time period, and in particular in the above discussion we use frequency relative to some known number of words or a specific average document size – in statistics, this usage is referred to as relative frequency. Counts are not interpretable without knowledge of the sample size or time period. The average interval between occurrences, which corresponds to wavelength in the time domain, is also a useful scale that will be discussed later.

Furthermore, the presentation by Church & Gale (1995) of Katz’s Kmixture pre-empts the publication in the same journal of Katz’s (1996) account of his model, and the specific notation and formulation of the models are quite different in the two papers. The K-mixture is presented as an additive model in which two components contribute to the probability estimate for documents that do not contain a term.

However, the corresponding model of Katz (1996) is a two parameter variant, G′, of a more general three parameter model, G, that has only one term applicable to any specific term frequency. In fact, G′ and G=G″ can be considered as first order and second order variants (resp.) of a more general model extending the straightforward zeroth order model, Gº, that depends only on the mean document frequency µ=DFw.

The following presents this model in a form that use the constructive notation proposed in Katz (1996) but deliberately eschews the adoption of the Greek letters used by him in his preferred presentation form, as they differ from those used above based on the usage of Church & Gale (1995). The first concept introduced by Katz is Burstiness, Bm = Ew(k | k≥m), with B0 corresponding to the distribution mean, B1 to average burstiness (before topicality is established), and B2 to topical burstiness (in model G, two occurrences are necessary and sufficient to establish

**topicality):**

Bm ≡ ∑r≥m Pw(r) r / ∑r≥mPw(r) Katz defines Pm = Ew(k+1|k,k≥m) as the average conditional probability of additional occurrences after seeing at least m, being the

**ratio of repeats to opportunities:**

Pm ≡ ∑r≥m Pw(r) (r – m) / ∑r≥m Pw(r) (r – (m–1)) = 1 – 1 / [Bm – (m–1)] A model of order m is defined as having a conditional probability of repeat occurrence that is independent of the number k of previously observed occurrences once the burst size is at least m. In this model Pw(k+1|k, k≥m) = Cm is assumed to be a constant independent of k for all k≥m in a model of order m, and thus the average of all such constant

**conditional probabilities Cm = Pm = Pk for all k≥m:**

Cm ≡ ∑rk≥mPw(r) / ∑r≥k≥mPw(r) = Pm (cf. Katz, pp38 & 43) The model order m thus corresponds to the assumption Pm ≈ Pk for all k≥m and allows us to define Pw(k) for these models as Gº : (1–P0)P0k G′ : (1–P0)0k + P0(1–P1)P1k-1[1–0k] G″ : (1–P0)0k + P0(1–P1)0k-1 + P0P1(1–P2)P2k-2[1–0k–0k-1] G{m} : ∑jm 0k-j (1–Pj)∏ij Pi + [1–∑jm 0k-j] (1–Pm)P An generalized term entropy for k = TFwd based on the this generalized Katzian G{m} model may thus be defined as –∑jm 0k-j log(1–Pj)+∑ij logPi – [1–∑jm 0k-j][log(1–Pm)+(km)logPm+∑im logPi] These order m models essentially treat the jm cases individually, and define the j≥m cases as if δ = δm = Pm ≈ Cm was defined by an order 1 model (G′ or K-) employing a membership threshold of m. Relating these parameters to the Greek parameters used earlier, we get P0 ≡ µ = DFw; P1 ≡ δ1 ≡ δ; Pk ≡ δk where these Pj values may be empirically calculated using (39 or 40), or estimated for k≤m or k=m using (41) for a smoother model and/or a more accurate tail. Simplifying and taking advantage of the exponential Kronecker forms, we get probabilities for k where δij≤m ≡ ∏ij δi ; δk,jm ≡ 0k-j ;

∑j≤m δij≤m (1–δj) δk,j≤m δk,j=m ≡ δmk-j Note that these forms are less convenient than the K-mixture and Pmixture due to the Kronecker terms δk,jm = 0k-j that appear for k0 in this formulation, but are avoided in the previously given mixtures (we are not concerned with the k=0 case documents that omit the term).

Nonetheless, we find this new notation both elegant and intuitive as we see that the higher order model is a smoothing model using a mixture of elementary models of different orders where δ terms for component models of lower order jm reflect a probability of contribution 0 to higher or lower orders, whilst the δ term matching the model order j=m reflects a probability of Pm for each successive term above the model order. This convenience is also observed in the P-mixture (26).

Katz (1996) has verified that the G-model performs considerably better than several lower order models including G′ and Church and Gale (1995) have similarly demonstrated the competitiveness of G′ in the form of their K-mixture, albeit validating on the training data in both cases rather than on independent data. A number of other researchers have used the technique or the concept of burstiness (Umemura, ZZZZ) in various applications and have found it effective, although it tends to be the K-mixture form that is used even when only Katz (1996) is cited (e.g. Gao et al., XXXX, YYYY).

The scope of demeaning and normalization There is however a prior question relating to demeaning that we will pick up here – how should normalize your data. The first aspect is whether you demean, by row or by column, over the entire matrix or by both rows and columns successively? These are four distinct possibilities (really five but two are mathematically equivalent, but will have different errors introduced numerically). The second aspect concerns whether you should scale the data, and a similar set of choices are available.

Demeaning is recommended, it is technically incorrect to apply SVD without it and it reduces the compression error. While technically demeaning should be done for the vectors in focus, in practise we would like to consider both interpretations of the data and a similar error reduction is achieved either way – we will call this orthogonal demeaning and it may in certain cases be more convenient. Double demeaning (vector and orthogonal successively) reduces the error further and while theoretically less justifiable it may be useful if it is desirable to reduce the error due to compression, and/or if the dual interpretation is also useful. Matrix demeaning involves subtracting the mean of the entire matrix from all entries and does not achieve any useful effect in general, achieving a fairly meaningless translation in both primary and dual spaces.

The correct normalization for AAT to be a covariance matrix is demeaning by row, since we are treating A as a set of row vectors.

More importantly this is appropriate if we are analysing, clustering or visualizing row vectors, e.g. each row describes a document in terms of the words contained in it, in typical IR usage, and we are interested in the relationships between documents rather than between words. For ATA to be a covariance matrix we should demean by column, and this is appropriate if we are dealing with the column vectors, which corresponds to us being more interested in the relationship between words than documents.

The minimal normalization for AAT or ATA to be a covariance matrix is thus demeaning, but this reduces the degrees of freedom for the vectors by one (this reduces the rank by one if A is square or we demeaned along the shorter dimension). Following up with standardization of A by dividing by the corresponding (row or column) standard deviation means AAT or ATA is a correlation matrix, whilst the more usual normalization to unit vectors (sum of squares 1) maps vectors to the unit hypersphere.

Undemeaned L1 normalization (normalizing the sum of magnitudes to be 1) by row or column without demeaning gives rise to conditional probabilities in the case of frequency data. However L1 normalization with or without demeaning is arguably inappropriate for SVD processing which is based on minimizing sum of squares error.

Furthermore, L1 normalization after demeaning no longer admits a probabilistic interpretation.

L1 normalization of an entire undemeaned frequency matrix (so the sum of all elements is 1) gives rise to probabilities, but normalization of the entire matrix (L1 or L2) is usually of little value as biases such as document size are preserved.

This leaves the double normalization, where we normalize the rows then the columns. Demeaning by row means all rows now sum to zero and have zero mean. Thus following up by demeaning the columns means we are subtracting a row of columns means with zero row mean from each row, so the row mean remains zero. Thus this procedure, or its column then row converse, lead to a matrix that has zero mean for all rows and columns, and formally the two approaches are equivalent.

Geometrically if the data is considered as a NxM matrix of N Mdimensional row vectors, demeaning by rows translates so the centroid is at the origin, and distances of the vectors from the origin (length) reflect variance. Demeaning by columns projects onto hyperplane Σi xj = 0, and the vectors reflect dissimilarity of the attribute dimensions.

L2-normalization of N vectors by scaling to unit variance or unit hypersphere means the sum of squares adds to N (naïve sample variance), N-1 (unbiased sample variance) or 1, but in any case that the vectors lie on the surface of a hypersphere. Each of these three steps, individually or together, costs a degree of freedom in the data, although L2-normalization is not linearly resolvable so does not affect rank (except to the extent it approximates L1-normalization).

Moverover, note that L1-normalization of non-negative attributes (e.g.

frequency data to probabilities) projects them onto the hyperplane Σi xi = 1 so that subsequent demeaning by columns does not cost an additional degree of freedom but represents a simple translation to Σi xj = 0. On the other hand, L1-normalization after orthogonal demeaning does still lose the third degree of freedom. Since L1-normalization sets the row sum to 1, the row sum of the mean vector (centroid) is also 1 and standard vector demeaning following L1-normalization obviates any need for orthogonal demeaning since the row mean sum of 1 is subtracted off each row sum of 1 giving a row mean of 0.

It is not possible in general to simply scale both columns and rows simultaneously to achieve L1 or L2 normalization, though in fact that effect is achieved by the SVD rotation and scaling process since the original square versions of both U and V are L2 normalized by both row and column. A normalization by scaling row then column versus column then row is generally quite different and the last normalization therefore should correspond to the vector normalization required for subsequent processing. However, the empirical investigation of Powers (1997) found the double normalization and the matrix normalization gave no advantage or made things worse: on average it reduced the performance achievable in subsequent clustering. On the other hand, the study showed that L1 vector normalization helped more than any other normalization (in this case frequency data was used and it factored out the size of the document), but L2 vector normalization was second best (it also helped with size).

Our conclusion is that the double demeaning is potentially useful as in general it reduces the sum of squares variance represented by the S matrix of the SVD over what is achieved by either column or row demeaning alone, it is theoretically insensitive to the order in which it is performed, and it means that the SVD analysis may be interpreted validly for both columns and rows.

Vector normalization by linear scaling may be helpful if size is not regarded as a desirable factor, and indeed the principal eigenvector will generally correspond to size (and thus not be useful) if this normalization is not performed. However, after normalization one of the largest eigenvalues may still correspond to a size-correlated eigenvector (in IR this is largely due to the different lexicon sizes of large and small documents – compare a dictionary or an encyclopaedia to a personal home page or a newsgroup submission).

Geometrically L2 normalization after demeaning places vectors on the unit hypersphere (circle in 2D; sphere in 3D) around the origin, whilst L1 normalization places vectors on a unit hyperhedron (diamond in 2D;

octahedron in 3D). L2 normalization is more appropriate for a subsequent SVD processing step with its second order nature and is thus a reasonable postdemeaning normalization step, whilst L1 is the best way to normalize for size and is a highly appropriate predemeaning step but is in the IR context is unnecessary if documents are constrained to be the same size. However, for IR with

**unconstrained document size, the following steps are recommended:**

1. L1-normalization of vectors (map to conditional probabilities, reducing freedoms) 2a Optional non-linear transformation of vectors (e.g. map to conditional information) 2b Optional orthogonal demeaning (translate by ∆xi = –1/M from Σi xi = 1 to Σi xi = 0)

3. Vector demeaning (mapping centroid vector to the origin, reducing freedoms).