# «Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

The second kind of error is artefact. In IR, artefact may refer to some systematic bias in the data, for example the register and content bias introduced by the use of the Wall Street Journal as a source of information about English. In EEG processing, artefact refers to signal that originates from sources other than the brain, in particular ocular and other muscular contamination. Artefact can result in very strong effects that correspond to large eigenvalues, so these reductions must be made by identification of the corresponding components. In the case of EEG this is achieved more effectively by using blind signal separation (BSS) techniques such as ICA.

Other kinds of errors that indeed produce very small artefacts include those that result from the finite precision used in the matrix calculations, and these should always be removed as they really represent zero variance.

Where eigenvectors are removed that correspond to noise, artefact or error, whether the eigenvalues are big or small, the sum of squared eigenvalues actually represents the net reduction in error, and this is routinely reflected in improved results in subsequent processing. Factor analysis can be understood in terms of subtracting an estimate of the noise distribution from the covariance at an intermediate point in the SVD process and will be discussed below.

Unfortunately, normalization is often neglected or incorrectly performed, which subverts the correct application of the theory, and an unnecessary error may be retained or introduced by dimension reduction. This is particular true in the IR application and the commonly espoused application of SVD in LSI, so we treat this example in detail.

Normalization in our usage includes demeaning as well as dividing by column or row sums or other normalization factors, but should sometimes also include distortion of the original data by the inverse of some non-linear distorting process.

Where different transformations are introduced prior to application of SVD, the interpretation of the noise that is eliminated will change correspondingly. We will discuss several different transformations below from other perspectives, but at this point we will characterize the effect of subtractive filtering for several methods.

Simple demeaning has the effect of removing an additive bias and is optimal in the second order sense of minimizing the sum of squared error given the distribution is sufficiently normal. The effect and advantages of subtracting other measures of central tendency are discussed later, but when applying SVD the arithmetic mean must be zero in order for the squared eigenvalues to represent sum of squares error. We will also consider below the effect of subtracting the covariance matrix of a noise distribution from the covariance matrix of our original data during Factor Analysis.

In the frequency domain, for example after applying the Fast Fourier Transform (FFT) to a speech signal, subtracting a measure of central tendency or a noise distribution corresponds to eliminating background noise or an uninformative carrier by eliminating its characteristic noise signature. For example, if samples are taken of background airconditioning and computer-noise, this has a characteristic frequency distribution that can be directly subtracted. For speech, centring on the fundamental and normalizing the variance cancels out basic frequency characteristics due to sex. If in the frequency domain we take the log and similarly subtract a measure of central tendency or a noise distribution, the effect is quite different. In particular for an acoustic signal, this approach can cancel out basic convolutive effects due to multiple paths and echo or reverberation. These are standard steps in signal/speech processing.

These last examples illustrate model-based elimination of noise prior to or during the SVD process, whereas the previously discussed reduction of the decomposition is hypothesized to eliminate certain kinds of noise subsequent to the decomposition and may be model-based (a priori), significance-based (eliminating the least significant eigenvectors), accuracy-based (eliminating eigenvectors whose contribution is less than the accuracy of calculation), or empirically-based (a posteori elimination of eigenvectors that are associated with identifiable sources of error or artefact).

IDF, TFIDF and Entropy We now consider the most common preprocessing technique used in IR, TFIDF (Term Frequency * Inverse Document Frequency). There are a number of common variations of TFIDF that relate to smoothing (e.g. adding one to all values or to zero frequencies) and normalization (e.g. normalizing TF by the frequency of the most common stem in a document rather than the length of the document. We will ignore these considerations, along with stemming, stopping and other processes unique to IR.

The two basic rates that underlie TFIDF are Term Frequency (TF) and Document Frequency (DF) and are expressed relative to a corpus of N documents of average size D, individual size Dd and lexicon size L. If TCw represents the number of occurrences of a specific term w in the corpus, then TFw = TCw/N is its expected relative term frequency and TCwd and TFwd = TCwd·D/Dd represent its count resp. normalized relative term frequency in the specific document d with the expected value of TFwd under the assumption Dd=D being TFw = ED(TFwd).

Similarly if DCw represents the number of documents w occurs in, DFw = DCw/N is its Document Frequency.

Note that TFw and TFwd are both assumed to be rates relative to a typical document of size D and we have made explicit an assumption that is both ubiquitous and iniquitous, but typically only implicit, in the IR literature – namely that all documents are roughly the same size D.

This is clearly not true in most domains, but it can be made truer by dividing documents up into segments or contexts of approximately the same size, choosing D to represent the typical size for topicality of a term. This may be regarded as an assumption that all terms tend to have bursts according to the same probability distribution (again not true), or used to help define what we mean by topicality and thus assist in parameterizing a more accurate model of the distribution.

The Information conveyed by DFw is IDFw = -log2(DFw), the misnamed Inverse Document Frequency, being the number of bits required to efficiently represent the fact that term w occurs in a document. This is a useful measure in its own right (Sparck-Jones, 1972) has a Bayesian justification as well as the information-theoretic motivation given here (Robertson, 2004), and captures the far greater importance of the fact of occurrence in a document versus the precise count or (relative) frequency. Note that much of the work using these measures assumes Dd ≈ D is (approximately) constant, and several corpora (e.g. Brown, BNC) use extracts of (approximately) constant size rather than full documents. It is also misleading to think of documents as being about various topics, as different parts of a document may touch on, or even spend considerable time on a topic and then move on to another, thus it can be useful to talk about contexts rather than documents, and these cannot in general be assumed to be of constant size either. We can however select extracts of constant size that relate to a specific topic.

We define TFIDFwd as TFwd*IDFw. As an entropy estimate this represents the contribution of word w to the size of the document when each occurrence is represented in IDFw bits, but this is not an efficient representation given words actually tend to occur much more than once in the documents they occur in. Thus it is somewhat surprising that TFIDF should for so long have been regarded (and empirically demonstrated) as superior to all other tested weighting schemes (Aizawa, 2003; Robertson, 2004; R&Y text, XXX).

The K-mixture and an improved entropy estimate Church & Gale (1995), pre-empting Katz (1996), proposed a model dealing with both the probability a document was a relevant context for a word (α) and the rate of occurrence of a term in a relevant context (β).

This formulation of the Katz model, the K-mixture, represents the probability of a term occurring k times in a document as Pw(k) = (1–α) 0k + (α/[β+1]) (β/[β+1])k where 0k is 1 if k=0 and otherwise 0.

Setting the parameters α and β for a term w may be done empirically, but it is usual to relate them to TF and DF. The rate of occurrence of a term w in a relevant context (β) may be estimated from the rate of occurrence in documents in which it is known to occur (γ = TFw/DFw), but γ would be a significant overestimate for β since these contexts already are known to contain one occurrence. Thus it is usual to set β = γ-1, however this now probably underestimates β as the context is reduced by one word and there is usually a minimum distance before a term recurs. According to this definition, β is set from the rate of additional occurrence in documents the term is known to occur in (β = [TFw- DFw]/DFw). We may now note that Pw(0) = DF is the probability of a null context and the sum of the independent cases of irrelevant contexts and null relevant contexts, from which we can show that α = TFw/β.

Note that Equation 25 uses γ as a denominator discounting both α and β, and the corresponding terms η = (α/[β+1]) and δ = (β/[β+1]) = [TFw – DFw]/TFw both have intuitive interpretations with η being the rate of occurrence of relevant contexts with no occurrences of the term (seen by setting k=0) and δ being the probability of an (additional) occurrence in a relevant document (seen by setting k=k+1). The function of γ here can also be understood in terms of a change from average rates in terms of documents to average rates in terms of individual words.

Note that the rate of occurrence of a relevant context (α) may be estimated from the rate of occurrence of documents containing the term (DFw), although this again would be an underestimate as there are in general relevant documents in which the term could be expected to occur but it does not (perhaps a synonym is used). The correction factor is the ratio of all relevant contexts to non-null contexts which is equivalent to the ratio of all occurrences to extra occurrences, 1/δ – as the rate of extra occurrences is assumed to be the same in a relevant context irrespective of whether it is non-null or null (viz. whether or not w actually occurs in that context).

We now re-express the K-mixture (25) as the more compact P-mixture (26) which is parameterized directly by two probabilities, showing the interrelationships compactly in terms of the derived rates and their interpretations as probabilities and expectations. Note that α is expressed as the sum of the geometric progression formed by the second term of the mixture summed over all k. Similarly β and β+1=γ, which are expectations rather than probabilities, are eschewed in this formulation in favour of (conditional and joint) probabilities δ, ε and η, allowing the P-mixture formula and other identities to be verified directly by multiplication of probabilities and conditional probabilities or expectations.

Identities 27 to 35 relate to the usual parameterization of the K-mixture known as the Katz model. We emphasize the pivotal role in the mixtures of the ratio γ (the expected number of occurrences of w in a D-term context that is known to contain w). The ratio β (the expected number of occurrences of w in a D-term context that is known to be relevant to w) and the normalized ratio δ (the estimated probability of an occurrence of w in a D-term context that is known to be relevant to

w) depend solely on γ. The reciprocal of γ, the ratio ε, is directly interpretable as the probability of non-occurrence in a relevant context and is thus preferred. ζ is introduced (32) as the solution of the quadratic that shows how we can derive γ from TFw and η, and takes on the pivotal role in an improved parameterization we introduce subsequently. λ and µ (33 & 34) are conventionally used in Poisson models based on term frequency and document frequency, but we discount these by β to derive probabilities α and η resp.

Assumptions. We make explicit the assumption of the K-mixture that documents may be approximated as being of fixed size D that represents the expected scope of relevance of a term. D is typically regarded as being of the order 200 to 500 tokens or 1000 to 2000 words. According to Zipf’s law the top 150 words accounts for 50% of the tokens in a corpus, and these are largely grammatical function words (closed class words). In addition at least 50% of the so-called content words (open class words) are also very general (generic) and this leaves 20 to 25% of tokens as topical terms. Thus we prefer to refer to D-term contexts, rather than documents, to emphasize that relevance (topicality) is local, that the probability PD depends on D, and that the large D assumption behind a Poisson distribution does not apply in a Dterm context.

We also make explicit the assumption that the fact that a document d contains a term w implies that w is relevant to d (but not vice-versa), express this as d ∋ w → d Я w, and generally omit d Я w (relevant for) when it is implied by d ∋ w (contains).

Pw(k) ≡ (1–α)0k + α[1–δ]δk = (1–α)0k + ηδk = (1–η/ε)0k + η(1–ε)k α ≡ λ / β = µ / δ = η / ε = γη = µ + η = PD(d Я w) β ≡ γ – 1 = ζ – ½ = [λ–µ] / µ = µ / η = δ/[1–δ] = ED(TFwd | d Я w) γ ≡ β + 1 = ζ + ½ = λ / µ = α / η = 1/[1–δ] = ED(TFwd | d ∋ w) δ ≡ [ζ–½] / [ ζ+½] = µ / α = β / γ = βε = 1 – ε = [λ–µ] / λ = PD(d ∋ w | d Я w) ε ≡ 1 / γ = 1 – δ = µ / λ = PD(d ∌ w | d Я w) ζ ≡ [λ/η + ¼]½ = γ – ½ = β + ½ η ≡ µ / β = α / γ = αε = α[1–δ] = PD(d ∌ w & d Я w) λ ≡ TFw = αβ = γµ = αγδ = βγη = ED(TFwd) µ ≡ DFw = αδ = βη = αβ/γ = αβ[1–δ] = αβε = PD(d ∋ w) = PD(d ∋ w & d Я w)) An improved term entropy estimate based on the above assumptions and model may thus be defined (36), noting that the model is defined by any two of the mutually independent parameters (viz. at most one of the pairwise dependent β, γ, δ, ε and ζ).