# «Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

φ κ γ γo = TFw/DFw Name f κP-mixture µ/κ κ κρ γo + φ/µ (λ-φ)/µ ∞ ∞ γo K-/γP- 0 γ mixture βP-mixture γo + 1 µ ρ β ζP-mixture µ/2 γo + ½ 2 2ρ ζ µP-mixture γo + 1 + ρ 1-µ (λ-ρ)/µ ρ µ These models and the corresponding corrections to just the TF=DF case may all be tested empirically by modeling on one (sub) corpus and validating on another. Such validation may include direct term fitting or practical application of the corrections to some application, such as the information retrieval search domain.

The true Katz models, G= G′′ and G′, and our generalization, G{m}, depend on a concept of relevance or topicality that requires bursts of more than one occurrence (viz. k≥2) to define topicality in the G model, and in G{m} that generalizes to k≥m. In our smoothing process for first order models, P0 = δ0 = µ doesn’t change and P1 = δ0 can be directly defined from the above table as δ = 1 – 1/γ. To generalize this to higher order models we distinguish two possibilities.

The first possibility is to adjust only Pm = δm ≈ Cm on the basis that all km are equally indeterminate in defining relevance although some may nonetheless be topical but not have the expected additional recurrences. This may be done analogously to the above using µm as the probability of topicality (k≥m occurrences) in place of µ and δm as the conditional probability of recurrence in a topical context (with k≥m occurrences) in place of δ. Note that the geometric series defined by (1-δm)δmk-m sums to unity for k≥m and thus sum over k≥m of the contributions of the final term of G{m} are independent of k so that adjustment of δm does not require any concomitant adjustment of δk for km to ensure the sum over all Pw(k) is unity as required for a probability distribution. This is then the first approach to consider.

The second possibility is to adjust all Pk for 0k≤m. This requires additional assumptions, and potentially a whole family of m constants κk and totally separate parameterizations δk(m) for each G{m}. The obvious way of avoiding the first of these problems, the explosion of new assumptions, is to assume that the probability of non-occurrence in a relevant context is the same whether or not the k≥m condition for topicality has formally been met, and independent of k. This is a generalization of the assumption that led us to define the µP-mixture using f = µ, and hence φ = ρ and κ = 1-µ.

The second of these problems, with its potential for overfitting, can be dealt with by defining the parameters δn sequentially according to approach one for the desired n=m as well as for all nm, and incorporating them into a variant of our second approach. This may however tip the balance to the opposite extreme and lead to a parameterization that is too smooth, but both the first approach to smoothing and the original G{m} model are more likely to be overfitted to the training data even though the definition of Pm is already an average over k≥m. Note that empirically Pw(k) is rarely strictly monotonic for a given w.

The optimal value of m may be a function of D, the typical document size, as this imposes a limit to k, viz. k D. Generally, we prefer a model that does not tune κ, f or φ to specific w and where m is a function of D or selected by observing when the order nm fit leads to δm ≈ δn.

The assumptions and notations underlying the unsmoothed G{m} model may be summarized as d ∋ w ↔ k ≥ m →w Я d; Cm = Pm;

That is, we define membership as having a threshold equal to the order of the model and implying relevance without relevance implying membership, and also treat the average conditional probability of additional occurences Pk as being constant Cm for k≥m.

The additional assumptions generalizing the µP-mixture approach to G{m} are Pw(w Я d |k=0) ≡Pw(k=m | k≥m & w Я d) = µm That is the probability of a context being relevant despite seeing no occurrences is equated with the probability of finding no further occurrences after seeing m. The first approach uses this only when m is the order of the model, whilst the second approach uses it for any m≤n where n is the order of the model desired. These are thus applied only to determine Pm in our first level approach to smoothing, giving G[m], but in our final variant we coopt the Pm (mn) values of the low order models with first level smoothing as the Pk used to define our fully smoothed model G(n). Variants corresponding to the κP-mixture may be used in place of assumption of identity 70, leading to models we denote Gκ[n] and Gκ(n), but as discussed above the latter may require a whole family of additional parameters κk dependent on µ and possibly m.

Modifications to term entropy model The TFIDF and TE models involve a log factor so we investigate if it can be made sparsity preserving by applying an orthogonal demeaning (the log factor and log term depend on w only, whilst the linear factor depends on both w and d). IDF is not appropriate for sparse orthogonal demeaning as it is constant for a given word and thus simply weights binary occurrence vectors in the word vectors of the dual formulation.

We will also see that TF can be sparsity-preserved during demeaning.

Orthogonal demeaning in IR means taking the mean across documents for each word, notwithstanding that we are primarily interested in using words as signature vectors to characterize documents, and thus would normally demean those vectors. We will thus swap to thinking in terms of comparing the similarity of words, with their similarity of sense being measured by the similarity of their signature vectors of document occurrence information. The first step is to note that the -log(η) term in the TE model is constant for a given word and is thus preserved in the mean of the corresponding signature vector. This is similar to the IDF binary occurrence weighting and should similarly be retained rather than demeaned in order to distinguish typical occurrences from nonoccurrences. This distinguishing term is missing from the TFIDF model, and some such term would thus need to be introduced, so the TFIDF model will become more like the TE model if we make it sparsity and occurrence preserving, with the obvious modification being to add in an IDF term (cf. equation 37). This is equivalent to using an ‘add one’ correction on TF: TF1IDF = (TF+1)*IDF.

The IDF factor in (TFwd+1)*IDFw and the -log(δ) factor in TE are constants (given w) and the mean over the full set of documents of TFwd is TFw. Thus the orthogonal means are (TFw+1)*IDFw and log(δ)*TFw –log(η) resp. These means represent the average document in the sense that its distribution of terms resembles the corpus as a whole. However, such a document would typically be encyclopaedic and thus not be representative in size (number of words in the sense of tokens irrespective of type) or lexical coverage (number of distinct terms or types). But note that we have actually assumed all documents are the same size or are normalized to equivalent size (in tokens), which also limits its lexical coverage L (in types). In fact, the fact that most actual documents are missing many terms (types) indicates they are not representative and the sign of the deviations from the orthogonal means tells us how much more or less than expected a word occurred in the document.

If we average only over the documents that contain a term, then the average of TFwd is TFw/DFw since it occurs in only DFw/N of the corpus documents. This we recognize as the pivotal ratio γ in the K-mixture or ζ in the κ=2 ζP-mixture model or β in the κ=1 βP-mixture. This leads to orthogonal means of (TFw/DFw+1)*IDFw and

-log(δ)*TFw/DFw – log(η) for our respective models.

The vector mean, TFd, corresponds to the typical (but non-specific) word frequencies of the documents for non-stopwords, and is not likely to vary much with the topic of the document, but rather reflects the nature of the document (encyclopaedia, dictionary, paper, story, letter, etc.) We thus drop the subscript and treat TF as a corpus or language constant. Words that occur more or less frequently than this represent words that are expected to be more or less significant in the implausible TF model. In TFIDF or TE (Eqn 37) it is harder to characterize as both factors are dependent on w and one is dependent on d as well.

However, if we do a double demean to avoid the dependence on d, we get an average value for TFd that we define as the constant TF which for all variants of the TFIDF and TE models is a simple documentindependent average of a function of our model parameters.

Intuitions about document topicality Suppose all our documents are of size D and we have a document that topically features word 1 (train, say) at some high level f11 = D*p11 and a second document that topically features word 2 (dog, say) at some high level f22 = D*p22. We will further assume no occurrences of word 1 in document 2 or vice versa. If we were simply to append the documents the new document would have size 2D, fs1 = f11 = D*p11 = 2D*ps1 etc. The absolute frequency of a term in the combined document is the same as in the respective original documents, but its probability is halved, and the expected interval between occurrences is doubled (although this is misleading as all instances of word 1 will be in the first half and word 2 in the second half). This combined document is moreover no more use to us than the separate documents – this we call encyclopaedic combination where words are topical in separate entries, but the overlaps and relationships are not in focus.

If we now summarized the combined document down to size D, it is unlikely that we would halve the frequencies of the two terms given these remain topical as intended, but nor is it likely that we could maintain the full frequency of occurrence given the independence of the topical domains as discussed in the documents. The act of summarization will furthermore not make the document more relevant which would require overlaps to be discussed, although its lesser size may make it more useful – and even though size is factored out this additional density may be emergent and this is one reason why normalizing by the frequency of the most frequent term (content word) in a document can be usefully used for TFIDF rather than mere document size (whether in words or terms). On the other hand, such a measure will also reflect changes in style, genre and register (e.g. age or vocabulary or author or audience, length of sentences, use of anaphora, use of hyponyms, hypernyms, technical terms and circumlocutions).

We could make the documents more relevant by adding new material that discussed relationships between the focus terms, but to do that we would have to drop existing material, perhaps replacing irrelevant examples for one term with examples that are in the topic domain of the other (e.g. taking a dog on the train instead of a bicycle, training a dog instead of rat). If there is implicit topical relevance in the other document, but it is not obvious due to use of a paraphrase, synonym, hyponym or hypernym, the summarization could make the relevance more apparent without changing its actually relevance (e.g. taking a dog on the train instead of an animal, training a dog instead of an Alsatian). This problem can be accommodated by using either implicit (e.g. LSA/LSI) or explicit (e.g. WordNet) semantic matching.

The Frequency and Time domains So far we have assumed that we are dealing with frequencies to the exclusion of consideration of time (viz. the order of words) and using a fairly large window to determine the frequencies (a document of size D which is typically of the order of 2000 words or more, and sometimes as large as entire chapters or even books). For many purposes, a smaller context is more appropriate, such as a clause, sentence or paragraph. In inducing syntactic information some knowledge of the word order is essential and often this information is determined over small contexts of a dozen words or less. For speech processing, the contextual information is often very limited and stored in the form of a Hidden Markov Model derived from Ngram statistics regarding sequences of N units (classically words). Models of the same order are also used in various compression schemes, and these models have also been used for statistical or semantic modelling since the 1960s (Wolff) as well as in the more general Minimum Message Length and Minimum Description Length models of theory evaluation. These have the advantage that sufficient information is retained to reconstruct a message indistinguishable from the original whilst maximizing the information that is implicitly stored in the compression model. In speech processing this level of physiological indistinguishability includes retaining all the information a human hearer would use to identify the speaker and distinguish characteristics such as gender, accent, emotion, etc.

In recording and analysing real world situations and data, these techniques can be extended to include multiple sensors and modalities, including multiple microphones (for noise localization and cancelling), cameras (for lip-reading and gaze tracking) and biosensors (for brain computer interface and monitoring of attention, stress, etc.) This raises further issues about the signal processing, compression and fusion of massive amounts of information from diverse sources. Again a standard approach is to define frames, windows or epochs of a relatively small size, and possibly on multiple scales and across multiple dimensions, and to perform spatial, temporal or spatiotemporal frequency analysis.

However, a common early step after frequency analysis is to compress using SVD, and this is a common precursor to more advanced techniques signal separation techniques such as ICA. Typically multiple sensors give rise to multiples streams of time-varying signal for data with dimensions sensors x time or sensors x frames x times, where the frames are arbitrary or psychologically plausible time segments. The standard technique for converting into the frequency domain is to convert the time information in to frequency information using an FFT or similar. This gives the data the dimensions sensors x frequency (spectrum) or sensors x frames x frequency (spectrogram).

Performing an SVD and/or an ICA transformation can now be used to reduce this to sensors x sources and sources x frequency or sources x frames x frequency.