«Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»
In clustering, to give a specific example, we note the existence of variant algorithms, e.g. K-medians versus K-means, K-harmonicmeans, K-geometric-means and the various density-weighted K-means algorithms, all of which can be interpreted as targeting the underlying probability distributions/densities and represent the cancelling of a bias by explicit (e.g. harmonic/geometric variants) or implicit (weighted variants) application of a distortion function. However, if we applied an appropriate distortion up front, a simple K-means would give equivalent performance to the harmonic and geometric variants. The case for the weighted K-means algorithms is a little different as some are designed to still find the underlying clusters based on arithmetic means but are concerned about the distribution from the perspective of sampling (e.g. increasing weight of low cardinality clusters in conformance to Zipf’s law).
Thus seeing a problem with the change in the referent of an expected value as a problem makes the assumption that frequency or probability is the correct thing to measure, whilst in fact rank, reciprocal and log are arguably more appropriate on various statistical and intuitional grounds. Furthermore, taking the mean of the logs centres around the arithmetic mean as technically required for our various matrices to be interpreted as covariance matrices, and in fact the reason the arithmetic mean or expected value is preferred in statistics is largely the second order error-minimization property that applies after normalization.
A particular consequence of the methodology proposed is that the null documents are shifted relative to the non-null documents as sparsity is retained by keeping them at zero. An alternative approach would be to subtract the log of the mean rather than the mean of the log, but this no longer represent the second order optimum in terms of minimizing SSE. These two models correspond to normalizing respectively the arithmetic mean or the geometric mean to one in the original frequency space. We now analyze and compare the two models and argue that the standard model where we apply the distortion and demean is indeed preferable, but that there are also modifications that may be useful.
The discrepancy is the same either way, but affects different subsets of the data and in this case corresponds to Information Bias whilst more generally it could be called Functional Discrepancy on the basis that after applying the function the expected value (statistically) is not the expected value (semantically).
So let us consider the properties of various measures of central tendency.
The arithmetic mean is only the expected value in the semantic sense under specific conditions, e.g.
1. An aggregate of random variables and the central limit theorem applies;
2. A word’s frequency or probability or occurrence intervals for an ergodic source.
In both these cases the term ‘expected value’ means something and it is actually the value you expect by dint of the stated assumption (e.g. the larger the interval the more likely the frequency will correspond).
However, the ergodicity requirement says that any large enough sample should be representative of the whole, whilst we have actually shown a better model is that there are relevant and irrelevant contexts that demonstrate quite different distributions, and the log-demeaning process deals with these separately. Although in a document or sample of size D words, both p = f/D, the mean frequency per token, and I = D/f, the mean interval between occurrences are arithmetic means, arithmetic averages of these over different words does not give compatible results = the arithmetic mean interval for a set of words is actually the reciprocal of the harmonic mean frequency for that set of words – the arithmetic mean is not appropriate for combining frequencies or probabilities of different words if it is actually the interval that is important, and it would seem that psychologically and linguistically our cognitive limitations determine that we should get reminded periodically what the topic is by reusing the word explicitly rather than anaphorically, and that refreshing will be required after a certain interval has passed.
There are many cases where the expected value is not semantically interpretable as such, e.g.
3. The random variable represents labels rather than measurements (e.g.
4. A variable with a semantic expected value is nonlinearly distorted (e.g. log, recip.) There are also other measures of central tendency that are used in statistics, and these come into play here as we consider the significance of the functional discrepancy between E(f(X)) and f(E(X)). We have also discovered another measure of central tendency, the entropy norm κ that is implied by the term entropy we derived from the K-mixture and P-mixture and justified in terms of balancing information about the document as measured by the fact of occurrence and the number of occurrences.
If we have a normal distribution, then the median, mode and arithmetic mean coincide. As the weight of the distribution is close around these values there is little opportunity for functional discrepancy – and this is increasingly the case for larger N and tighter distributions.
Returning to the specific functional discrepancy in moving from probability space to information space using the log function, and considering the relationship between the arithmetic and geometric means of the underlying frequency data, we note that the geometric mean is more sensitive to low outliers (esp. 0, which we however discount by taking means only over documents containing the target word) and less sensitive to high outliers (which is often regarded as a good thing as these are unbounded).
Considering the binary case for non-negative values (frequencies) a and b, am(a,b) = half(a+b) = sqrt(a*b) = gm(a,b), with equality holding only when a=b.
The functional discrepancy thus relates to the variance, and noting that a and b both differ from the arithmetic mean by dev(a,b)=abs(a-b)/2 we see that the variance represents the deviation between the squares of the arithmetic and geometric means.
var(a,b) = dev(sqr(am(a,b)),sqr(gm(a,b)) This relationship of functional discrepancy, fd, to variance immediately leads to some understanding of the extent of the fd for different possible distributions. For a Poisson distribution applicable to our hits (the documents containing the target term) we have am(a,b) = var(a,b) The P-mixture essentially defines two probabilities in our model of word occurrence in documents (and implicitly assumes documents of equal size D). These can be characterized as the probability of a relevant context (α) and the probability of an additional occurrence in such a relevant context (δ) – it allows for zero occurrences in a relevant context and models relevance separately.
As we are processing only non-null contexts, and for now treat relevant contexts as being the hits for a word, we simply use a Poisson distribution to model occurrences. The Poisson model predicts the probability of a document of size N (N large and hence omitted) containing n occurrences of w when the expected number of occurrences is m as Pm(n) = mn / em / n!
Since we know the document contains an instance of w, we used formula 30 to predict additional occurrences, and investigated the relationship of the median, mode and arithmetic and geometric means computationally using this model, showing that the Poisson model with an overall the number of occurrences n is characterized by median: n mode: n geometric mean: n+0.5 arithmetic mean: n+1.0 Our original observation was that a sparsity-preserving demeaning of the logarithm reflected the log of the geometric mean of the data, and that the arithmetic mean in the log-normalized space allowed proper interpretation in correlation and SVD processing. We showed that seeing zero probability as a limit, nulls correspond to the geometric mean of the original data for both non-null and null cases. In this model the transformation actually potentially increases sparsity as the nulls stay at zero and any documents matching corpus level expectations become additional zeros. This corresponds to the standard approach to distortion and demeaning [Powe97].
The alternative approach of subtracting the log of the arithmetic mean puts the zeros at the values corresponding to the arithmetic mean of the original data and ensures that nulls correspond to the arithmetic mean of the original data for both non-null and null cases. This is motivated by the traditional view that the arithmetic mean represents the expected value of the data, but then the log-distorted version is not properly demeaned and thus the correlations and SVD reductions have parametric error introduced into them.
Moreover, since mode represents the salient value in the distribution and the most probable prototype, it is possible to argue that this is a more plausible measure of central tendency and hence the closer geometric mean (expected displacement of 0.5) is more appropriate than the arithmetic mean (expected displacement of 1.0). Also note that the document groups around the mode and the median are unchanged by the application of any monotonic distortion, but that the arithmetic mean now identifies documents that were formerly identified with the geometric mean rather than the arithmetic mean, so both of these are less resilient than mode and median. Furthermore, if the logarithmic (information or information gain) space is meaningful, and we have argued that intuitively it is an equal interval scale, then its mean is reasonable to use too. Also note that information is an absolute scale with an absolute zero (achieved at probability 1). Moreover the magnitude of the information deviation is also an absolute scale with zero indicating a totally unsurprising outcome, a non-zero magnitude specifying the minimum description length for the deviation from expectation, with the sign being an additional bit that indicates a direction in the sense of gain or loss. Thus information deviation is also an absolute ratio scale notwithstanding the availability of the sign.
Katzian analysis of sparsity preserving log model The first observation regarding the simple splog model is that it does not distinguish between the case where a word only ever occurs a single time in a relevant document (or more generally occurs the expected number of times in a document), the case where as expected it does not occur in an irrelevant document, and the case where it does not occur in a relevant document. In the first two cases the value is 0 because it matches the empirical expectation, and the third is not distinguishable by our simplistic model. When TF=DF, there is a more serious problem as the variance is zero so p(TF≠DF)=0, so that in the Katzian models, β=0, γ=1, δ=0, ε=1 in violation of our assumptions, and α and η are consequently not well defined. There are several separate issues here arising from specific assumptions and that all cases where a probability is 0 or 1 are degenerate. In particular we want to
1. There is a non-zero probability of occurrence in an irrelevant context (α)
2. There is a non-zero probability of an extra occurrence in a relevant context (δ)
3. There is a non-zero probability of null occurrences in a relevant context (ε) The implication of point 1, is that zero occurrences is actually an underestimate of the expected number of occurrences in an apparently irrelevant document (any term may occur incidentally or as a change of topic), so that the document information gain should actually have a negative value (a loss), the count being less than expected. To preserve sparsity, we could compensate for this by adding αµ to all non-null terms before splogging as the probability of occurrence in an irrelevant context, or µ=DF as the probability of occurrence in any context. The latter is the better choice as we specifically recognize than nonoccurrence does not imply irrelevance, where as the former corresponds to the usual IR search models where non-occurrence is equated with irrelevance.
Thus one solution is to view frequency 0 as a deviation from µ, being our estimate of the expected number of occurrences in an arbitrary context. Since we want to maintain sparsity and keep the expected values at 0 (and these zeros dominate the matrix and heavily influence the mean), we propose to maintain relativity by adjusting the values for non-null occurrence by subtracting µ (modification µ). Given some of our null-occurrence documents may be relevant, it would also be possible to subtract α to demean that subset of the documents (modification α). However in the absence of some independent way of determining relevance we cannot distinguish the relevant documents and they overlap both the null and non-null occurrence documents.
The implication of point 2 is that we need to implement some kind of improved estimate of δ in at least the cases where λ=TF=DF=µ. This could be implemented as some kind of smoothing, and in particular we have an implicit assumption that the probability of occurrence in an irrelevant context is cannot exceed the probability of occurrence in a relevant context, and both are strictly non-zero (0µαδ1). Note that if we have eliminated stopwords and are limiting our terms to topically relevant terms, the central inequalities should be strict as relevant topics should be more frequent than documents containing terms as illustrated by synonyms, paraphrases and definitions, and the Katzian cluster model is based on the idea that words are more frequent once the initial occurrence has occured. An ‘add one’ or ‘add half’ style correction would note that E(TF)E(DF) according to the model and that it is just that in a limited size corpus we are seeing specific TF and DF values from a distribution that happen not to satisfy TFDF, and propose correcting this with an estimate TF*=DF+φ, for φ = 1.0 or 0.5, or φ/µ = 1.0 or 0.5.
However, this can actually lead to violations of probabilistic conditions (probabilities exceeding 1), so that a more intelligent approach is needed. We therefore introduce an increment φ that is a function of µ and a factor f that specifies the proportion (probability) of nonoccurrence contexts that are in fact relevant. Our constraints give us a smoothing model of the form ‘add φ to µ’ (adding to empirical corpus term frequencies) or ‘add φ/µ to γ’ (adding to expected document term frequencies or counts, TFwd ≈ TCwd) defining the reasonable values of φ