«Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»
Note that as discussed above demeaning step 2b is not required after L1-normalization unless a non-linear transformation at step 2a undoes the L1-normalization, as vector demeaning will also achieve orthogonal demeaning. Double demeaning (explicit or implicit) will reduce the error of SVD-reduction. In order for this sequence not to hide error we implicitly assume that the size information lost in step 1 is not directly informative (size of document does not provide information about what the document is about) and similarly that step 2 is information preserving. This is automatically so for transformations that are reversible, as is the case for 2b. This is also the case for 2a if monotonic functions are used. Given data is strictly positive logx, tanx, tanhx or xk are all possibilities, but the 0 condition need not obtain and indeed in IR matrices are usually sparse in violation of this condition. This would thus appear to exclude direct application of log, which is unfortunate as translation to the information domain would appear to be semantically advantageous (as we feel proportional increases in frequency are equivalent) whilst preserving sparsity is computationally advantageous (in terms of both time and memory).
Postnormalization (after step 3) is not recommended as it will destroy the double demeaning. This includes standardization (so that ATA is a correlation matrix, as commonly performed in ICA algorithms) and any other form of L2 normalization in particular (and in general any transformation that is non-linear for either rows or columns) means that the loss of degrees of freedom is no longer reflected in rank.
Recall that the SVD process produces two orthonormal matrices, before reduction. Even after rank reduction the column vectors are effectively L2-normalized and orthogonalized, but the row vectors are shorter in proportion to the squareroot of the reduction in rank – with rank reduction from N to K, length is √(K/N). With lossy compression the shortening is given by the foreshortening due to the magnitudes of the components in the ignored dimensions, viz. 1 – √(Σixi).
Renormalization is thus not desirable for purposes of 2D visualization or other lossy reduction paradigms, as the length of the vectors (distance from centre) is indicative of the information that is present in unshown/unretained dimensions. For purposes of comparison (distance or similarity measures) in the absence of lossy reduction, renormalization is unnecessary as it remains L2-normalized.
In this context, dot product corresponds to D2=cos(θ) where θ is the angle between vectors, varying from +1 for positively correlated through 0 for uncorrelated to –1 for negatively correlated (as θ goes from 0 to π/2 to π), and Euclidean distance corresponds to L2=2sin(θ/2), varying from 0 for identity to 2 for diametrically opposite (as θ/2 goes from 0 to π/2). Note that θ is itself a linear distance measure whilst dot product and Euclidean distance represent non-linear (inverted sigmoid and half-sigmoid) distortions of θ.
Furthermore, L2 = (2–2D2)1/2 and D2 = 1 – L22/2. The Manhattan distance, L1, the Canberra distance and other variants of the Minkowski metrics, Lp p≠2, are not rotation invariant and hence particularly inappropriate in conjunction with SVD. ICA as an alternative to SVD represents a further rotation and thus is pointless if rotation invariant distance measures are used, unless it is used to identify and eliminate irrelevant dimensions (as a form of lossy compression designed to lose noise rather than information).
The sparsity preserving log transformation (splog) We noted above in several places that demeaning should be performed before SVD, but often is not, particularly in the context of LSI. The mean represents the best fit in a sum of squares sense and so the sum of squares is not minimized if the mean is not zero, and the squared eigenvalues are not interpretable as variances. Thus in order for the error resulting from lossy reduction to be minimized it is necessary to demean, and the error introduced by the dimension reduction will tend to be doubled for LSI, since it will be biased by the mean, and the IR frequencies may be modelled in terms of a Poisson distributions for which the mean and variance are equal.
So this raises the question of why demeaning is not performed. In some cases, it is simply ignorance of the importance of this step in what has become a black box technique. However, in the case of LSI we are dealing with term frequencies that are zero for most terms given a document, and for most documents given a term. In the case of terms, there will typically be at least a 1000 times more zeros than occurrences for any document. In the case of documents, the factor is similarly high for all terms once the frequent closed class lexical items have been excluded as stop words. The overall sparsity reflects the average sparsity of both document and term vectors, and is lost if we demean either document or term vectors as the nulls now all become – mean, and they cannot meaningfully be left at zero.
However, if we take the log (or –log) of the non-zero values and move to the information domain, the arithmetic mean in this information domain represents the geometric mean in the original frequency domain, and this is also a valid measure of central tendency that we will discuss in more detail below. We will assume we are treating document signatures, as in IR/LSI, rather than the word signatures, as in other applications such as word sense disambiguation (WSD). Thus we will subtract the mean term information in relation to the non-zero terms of each document, but in this case we will argue that we can reasonably leave the zero entries at zero.
Demeaning in the information domain gives a balanced spread around an expected information of zero (frequency equals expected frequency and the log of their ratio is 0) so that the null documents (those in which the frequency was zero) can remain at zero (0 frequency equals 0 expected frequency) for no information gain or loss. Positive figures correspond to the document's information gain for the word: higher occurrences mean less information is gained by being told a word in the document is the target, reflecting the fact that the document has more information about this word. Negative figures correspond to information loss which tends to be spread over many words as, except for the excluded closed class words and a few other instances of similar generality, words other than the topical words will occur less often than usual in the document simply because they are not the topic and are being displaced by topic-relevant words.
Note that the expected count for a term in an irrelevant document is close to zero and is much less than the expected count in a relevant document (and indeed we have assumed that the occurrence of a word defines a document as relevant). For terms that do not occur in a document, that we ignored in taking the log and the mean above, we note that since we know there a 0 occurrences of the term, 0 is arguably the expected number of occurrences in the document, and so there is no deviation from this expectation and hence no information gain or loss.
The demeaning model based on this argument we will call splog0. This is simplistic and we will shortly analyze the error introduced by these assumptions about demeaning and the fact that we are essentially demeaning two subsets of the documents separately.
Scales and distortions We noted earlier that sometimes it is appropriate to normalize using some non-linear processing step. In fact, the difference between SVD and higher order ICA can largely be understood in terms of ICA matching a sigmoid to the cumulative distribution function of the data (the probability of a value being less than or equal to k, being the sum up to k of the probabilities for discrete possibilities i, or the integral up to k of a continuous probability distribution function p(i)).
We have now seen some applications of the non-linear logarithmic distortion, which has raised the question of whether this is a good thing from various perspectives.
In general statistical analyses are based on specific assumptions and these include assumptions about the scale of measurement. For example, it only makes sense to add values that are measured on an equal interval scale. In the case of frequency, our intuition is that adding one extra occurrence of a term to a document with one occurrence is far more significant than adding it to a document with a thousand occurrences. Thus we think that it is the percentage increase that is important. Taking the logarithm means that variations by the same percentage are shifted by equal intervals, and this is part of the value of the concept of information.
Another useful assumption that can be made about scales is that there is an absolute zero and it makes no sense to have values less than zero – only in such a scale does it make sense to talk about ratios, something being twice something else, for example. For this reason, absolute scales are also known as ratio scales. The ratio scale property holds for measurements of size or interval, but is not true for measurement of location. In the IR application, zero frequency of a term and zero interval between occurrences of a term both thus delimit absolute measurement scales. Note that frequency and wavelength have a reciprocal relation, and so do frequency and interval. Both average interval and probability can be seen as being derived from frequency, but equally well frequency could be regarded as being derived from an underlying probability distribution or an underlying repetition interval (buses are due every 15 minutes = 4 per hour = a 1/15 probability of a bus arriving during any particular minute). The Poisson distribution actually derives from this idea of the expected interval to the next occurrence, noting that it cannot be smaller than zero as that would mean the latest bus arrived before the last bus, which is a contradiction.
Zipf’s Law tells us that if we order the events (e.g. occurrence of terms) by their probability and number the most frequent term (1) to the least frequent term (L), this rank is inversely proportional to the frequency, or directly proportional to the interval. Taking the rank is a standard statistical technique used when the data does not obey the assumptions of the paradigm used and so parametric techniques are not appropriate.
After taking the rank we end up with an absolute equal interval scale and we can now use techniques that make such assumptions, the composite approach of taking the rank and using a parametric method being a non-parametric method that does not require these assumptions about the underlying distribution.
Frequency and probability do not obey the appropriate assumptions for the usual statistical methods, but taking rank, reciprocal or logarithm achieves an absolute equal interval scale, otherwise known as a ratio scale. Empirically, in many language-based applications better results tend to be achieved with one of these scales than with the raw frequency or probability scales (see e.g. [Powe97]).
Note that the normalization of frequencies to probabilities or conditional properties is identical with L1–normalization on a matrix or vector (divide by the sum of magnitudes, before or after demeaning) but is undone by the L2-normalization appropriate to and performed by SVD (divide by the sum of squared magnitudes, recalling that the U and V matrices are L2-normalized). These are linear normalizations which we distinguish from the non-linear distortions represented by reciprocal and logarithm. Visually a good representation, with an intuitively plausible equal interval scale, has a very even spread of points. A poor representation will tend to be more or less dense around the mean and will lead to the development of inappropriate clusters.
Taking the logarithm fits our intuition about ratios being important at the level of frequency as well as our intuition that information should be represented in bits (or related units such as nits or digits where natural or decimal logarithm are used rather than the usual binary logarithm).
Taking the reciprocal or the rank fits the Zipfian model and evenly spreads out the words according to their characteristic frequency (either in the corpus or in a document).
The choice of distortion will in general affect the results. For example in [Powe97], the rank distortion tended to be the most robust, finding the expected vowel class, with y as a possible inclusion, whilst the logarithmic distortion tended to class space as a possible vowel (with or without identifying the vowel class in its own right). On the other hand, the reciprocal distortion (despite its Zipfian similarity to rank) was not very successful at finding the vowel class.
In the IR application, one of the prime goals is correct ranking of documents relevant to a query, and pre-ranking of the terms has also proven to be a useful step in preliminary explorations of its utility in clustering. The TFIDF processing step discussed above, and the subsequent Katzian modification, both make use of a logarithmic distortion into the information domain. These distortions should really be understood in the etymologically accurate sense of untwisting the data in a way that removes unhelpful biases.
The unexpected value We now return to consider in more detail our observation that the proposed sparsity preserving method of demeaning in log space (splog) effectively treats the original geometric mean as the measure of central tendency. We also note that the existence of the various alternatives is evidence that other measures are at times more useful, and in particular we note that the geometric mean is appropriate where a logarithmic distortion gives the required scale, and similarly the harmonic means is appropriate where a reciprocal distortion gives the required scale, and the respective distortions map these means to the arithmetic means required in the distorted space.