«Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»
Smoothing, Normalization, Correction and
David M.W. Powers
School of Informatics and Engineering
Flinders University of South Australia
In this paper we analyze a variety of standard techniques used in corpus
linguistics and examine a number of issues relating to incorrect usage,
the computational infeasibility of various approaches, and the
inevitable sampling errors. We motivate much of the paper with
applications from information retrieval but the techniques discussed and introduced are far more widely applicable, and some are indeed motivated from other applications of corpus linguistics.
The Singular-Valued Decomposition We summarize first some of the well known properties and applications of singular-valued decomposition (SVD). This has been popularized recently in the context of information retrieval (IR) as latent semantic indexing or analysis (LSI/LSA) and we will use this application area as the primary example for making the algorithms and theory concrete.
We will however refer to other example domains including automatic speech recognition (ASR) and electroencephalographic analysis (EEG).
We will assume that data is provided in matrix A = aij and make varying assumptions about its normalization or distortion. In general both A and AT have interpretations as representations of one domain in terms of another, where the superscript T represent taking the transpose, viz. AT = Aji. Here A represents the relationship between two different aspects of our data, e.g. documents versus words, or sensors (electrodes, microphones, pixels) versus time. We will try to avoid the word ‘dimension’ as much as possible because it can be used to describe both the horizontal and vertical size and indexing of a matrix as well as the different attributes or features of a vector (e.g. a column or row of a matrix) as well as totally different modalities or sources or domains (e.g. electrode, epoch, space, time).
There are different conventions as to whether the rows or columns are interpreted as vectors, and indeed due to the duality of the analysis we can flip between the conventions at will by matrix transposition.
However, strictly speaking the data should be demeaned, that is subtract off the means of the vectors, and normalized, that is divide by the standard deviations of the vectors, before applying SVD. The demeaning leads to different representations depending on whether it is done for the rows or the columns, and should be done for the vectors that are being analysed.
For example, in IR we represent the corpus as a set of document signatures based on word frequency, and it is convenient to regard the set of features, in this case words, as fixed whilst allowing the possibility of meeting new documents and queries which we want to locate in terms of the signature space. It is common to represent these in a file with one document signature per line, so the corresponding matrix matrix has one document signature per row. It is obvious that this file representation is more easily extended with additional data points in respect of new documents or queries (pseudo-documents). It is thus the rows that are demeaned and normalized, however transformations on the columns may also be carried out and TFIDF (see below) is a kind of column normalization that compensates for the information capacity of the corresponding word (more frequent words carry less information).
Note that row normalization removes any direct sensitivity to the size of a document (though some indirect sensitivity should remain as longer documents tend to be more encyclopaedic in nature whilst small documents tend of necessity to be on a single specific topic). Note too that vector normalization removes a degree of freedom as the last value is predictable from the others given the constraint induced by normalization (the squares add to 1).
The standard notation for the SVD of A is A = U S VT where U and V are orthonormal matrices representing rotations and S (sometimes L) is a matrix representing a scaling which has the eigenvalues of A, sii, represented in descending order along the diagonal, and zeros elsewhere.
The standard LSI notation varies from this including using L (sometimes S) to represent a diagonal matrix with the squareroots of the eigenvalues, called singular values, and using D (for document) and
T (for term) as the names of the rotations:
A = D L2 TT The original SVD formulation derives from the application of the Eigenvalue decomposition known as Principal Component Analysis (PCA) to the covariance matrix AAT or ATA (which ever is the smaller,
this defining the rank of A):
AAT = U S2 VT ; ATA = V S2 UT This follows by observing that as U and V are orthonormal their transposes represent the inverse rotations. More efficient calculation methods will be discussed later.
Reduction and Compression If A is a matrix with u rows and v columns (u x v), then in the unreduced formulation, U has dimensions u x u, V has dimensions v x v, and S has dimensions u x v (viz. it is not square and thus not diagonal in the sense that implies being square). However the rank of A is s = min(u,v) and S may be reduced to Ss with dimensions s x s, with a corresponding reduction in U or V by removing rows or columns that are multiplied by columns or rows of zeros in the unreduced S. This compression is completely lossless (apart from rounding errors already inherent in the unreduced version) and thus the subscript is usually dropped so that the same equations used for both the reduced and unreduced versions of SVD, though in reality there is no practical use for the unreduced versions.
Since the eigenvalues of S are represented in descending order, as is the usual convention, a further potentially lossy compression may be achieved by reducing S to a k x k matrix Sk by dropping the rows and columns with greater indices and the smallest eigenvalues, along with the corresponding columns resp. in U and V. If there are any 0 eigenvalues, there is no loss – and at least one 0 eigenvalue will be present (apart from rounding errors) given the vectors have been normalized as described above. If non-zero eigenvalues are removed, then the compression is lossy and the error introduced may be characterized in terms of their values.
The interpretation of the SVD will be discussed in more detail later, but to understand the errors introduced in compression it should be noted that the principal eigenvector represents the line of best fit in the sense of minimizing the sum of squares error (SSE) and that this is rotated to become an axis and then scaled to become a unit vector. Once this source of variance has been removed, the remaining eigenvectors can be understood recursively as discovering the other tendencies of the data. An important corollary of this is that the error (SSE) introduced by compression is the sum of squares of the eigenvalues removed, and represents the variance that is not accounted for by the model (assuming A has been correctly demeaned as above). If this SSE is divided by the sum of squares of all the eigenvalues, it may be interpreted as an error rate or probability (depending on the application).
There are some papers in the LSI/IR literature that incorrectly refer to SVD as discovering independent features, but in fact it discovers only uncorrelated features and then only if the demeaning has been done as discussed above (and in the LSI literature it is normally not done). SVD minimizes the sum of squares error and is thus second order. This makes sense when dealing with data from a distribution that approaches the normal distribution, but in general independence requires dealing with higher order errors as well, and this class of algorithm is known as Independent Component Analysis (ICA). ICA algorithms usually perform SVD first and then minimize some or all higher order errors (often only fourth order).
Interpretation We will focus on interpreting in the IR domain so that we contrast with the approach taken in LSI, but before we look at the interpretation of the SVD matrices it is useful to note the following identities and derived matrices. We use a superscript T to indicate the transpose, P to indicate the pseudoinverse, -1 to indicate the inverse, PT to indicate the transpose of the pseudoinverse or equivalently the pseudoinverse of the transpose, and -T to indicate the transpose of the inverse or equivalently the inverse of the transpose, noting that only square matrices of full rank have a true inverse. The original identities (1 to 4) are in unreduced form but the derived identities (5 to 24) are in reduced form, with SSE equal to the sum of squared eigenvalues omitted.
A = USVT AT = VSUT where U is orthonormal, UT = U-1 and V is orthonormal, VT = V-1 and S = L2 = UTAV is diagonal AP = VS-1UT APT = US-1VT I = L-1 UTAV L-1 = (UL)-1 A (VL)-T
AV = USVTAT = SUTVTAP = S-1UTAPTV = US-1UTA = SVTATU = VSAP U = VS-1UTAPT = S-1VT
AVS-1 = APTVS = U S-1VTAT = SVTAP = UT S-1UTA = SUTAPT = VT ATU = SUTAPT = V AV LP = UL LPVTAT = LUT LPUTA = LVT ATU LP = VL Equations 1 and 2 are repeated here as the definition of SVD and its application to the dual problem, transposing rows and columns. The unreduced U and V matrices are orthonormal matrices which means that they are square, that the sum of squares of both rows and columns is 1, that both rows and columns represent an orthogonal basis and are uncorrelated, that the matrix may be interpreted as a rotation around the origin, and that their transpose is their inverse and represents the opposite rotation (3 and 4). S is a diagonal matrix which when it multiplies on the left scales the rows, and on the right scales the columns, by the corresponding eigenvalues. The product, sum and inverse of square diagonal matrices are simply the diagonal matrices composed of the product, sum and reciprocal of the respective elements. The pseudoinverse SP of the unreduced form of S formed by replacing the eigenvalues on the diagonal by their reciprocals and taking the transpose becomes the true inverse S-1 in the reduced form.
(When a matrix X is not square it does not have a unique inverse but will in general have multiple pseudoinverses that satisfy XPX = XXP = I.) Thus we have (pseudo)inverses for all three matrices in the decomposition and this allows us to determine pseudoinverses of A & AT (6 & 7) as well as the identities that are the basis of LSI. In particular, identity 8 shows that UL maps the same latent values to documents that VL maps to rows. Identities 6 to 24 are all derived by the process of multiplying both sides of a previous identity by the (pseudo)inverse of a term that is to be cancelled from one side.
Representations 17 to 20 reveal various additional interpretations of U and V and their transposes. In particular U and V are not only rotations, but are also normalized representations of the vectors in their own right: U is a representation of the rows of A rotated by V and normalized to the unit hypersphere by S-1, and V is a similarly a rotated and normalized representation of the rows of AT. In their unreduced form U and V are actually quite uninteresting as they simply represent unit vectors along orthogonal axes. In particular, the unreduced U and V matrices are not useful for classification or clustering as each is equidistant from all the others. However in reduced form they are very useful for visualizing two dimensions at a time as the deviation from the unit circle represents the variance in the hidden dimensions – they will be on the circle if they can be described fully in the depicted latent dimensions alone, whilst if there is little relevance to those dimensions they will be located near the origin. Note that unless A is square and unnormalized, U and V will always have been reduced (to s-1 columns, the rank minus the degree of freedom normalized out).
Representations 9 to 16 are those that are most appropriate for purposes of compression, and where the reduction is lossless the results using L2 and Cosine metrics, and the clusters formed by clustering using these metrics, will be identical to (or in the case of visualizations, rotations of) those obtained with the original matrix, but they should be obtained more cheaply due to the reduction in the matrix sizes.
Representations 21 to 24 are preferred by LSI in the belief that only they will allow words and documents to be depicted or accounted as near each other when they relate to the same topic. In fact, all of representations 8 to 24 have good properties in this respect with representations 9 to 16 giving the original results, 17 to 20 giving results that completely normalize out the strengths of the different latent variables (LSI topics or senses) and 21 to 24 representing a compromise that does neither. This can be seen by observing that 9 to 16 are simple rotations of the original data and all the others are derived by application of a simple scaling by either L-1 for the LSI representations 21 to 24 or S-1 for the normalized representations.
Identity 8 merely shows that the same scaling L applied to U and V produces a pair of dual transformations and representations UL and VL, each of which can transform A (or AT) into the other and when both are applied the distortions cancel. On the other hand identity 7 shows that the bilateral application exposes the variances that apply in both contexts.
Noise, Artefact and Error Data compression and efficiency is not the only reason for reducing the matrix A to a more compact representation. In fact, often the reason is that rather than increasing error the reduction can actually reduce error of various kinds.
The first kind of error is noise. In LSI, noise means the apparently arbitrary choice amongst different synonyms or other alternate expressions, idioms or metaphors. The variance that results from this is often modelled using some variant of the Poisson distribution (and we will discuss normalizations appropriate to these models below). The theory is that the low order eigenvalues correspond to this irrelevant variation.