FREE ELECTRONIC LIBRARY - Thesis, documentation, books

Pages:   || 2 | 3 | 4 | 5 |   ...   | 8 |

«Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

-- [ Page 1 ] --

Smoothing, Normalization, Correction and


David M.W. Powers

School of Informatics and Engineering

Flinders University of South Australia


In this paper we analyze a variety of standard techniques used in corpus

linguistics and examine a number of issues relating to incorrect usage,

the computational infeasibility of various approaches, and the

inevitable sampling errors. We motivate much of the paper with

applications from information retrieval but the techniques discussed and introduced are far more widely applicable, and some are indeed motivated from other applications of corpus linguistics.

The Singular-Valued Decomposition We summarize first some of the well known properties and applications of singular-valued decomposition (SVD). This has been popularized recently in the context of information retrieval (IR) as latent semantic indexing or analysis (LSI/LSA) and we will use this application area as the primary example for making the algorithms and theory concrete.

We will however refer to other example domains including automatic speech recognition (ASR) and electroencephalographic analysis (EEG).

We will assume that data is provided in matrix A = aij and make varying assumptions about its normalization or distortion. In general both A and AT have interpretations as representations of one domain in terms of another, where the superscript T represent taking the transpose, viz. AT = Aji. Here A represents the relationship between two different aspects of our data, e.g. documents versus words, or sensors (electrodes, microphones, pixels) versus time. We will try to avoid the word ‘dimension’ as much as possible because it can be used to describe both the horizontal and vertical size and indexing of a matrix as well as the different attributes or features of a vector (e.g. a column or row of a matrix) as well as totally different modalities or sources or domains (e.g. electrode, epoch, space, time).

There are different conventions as to whether the rows or columns are interpreted as vectors, and indeed due to the duality of the analysis we can flip between the conventions at will by matrix transposition.

However, strictly speaking the data should be demeaned, that is subtract off the means of the vectors, and normalized, that is divide by the standard deviations of the vectors, before applying SVD. The demeaning leads to different representations depending on whether it is done for the rows or the columns, and should be done for the vectors that are being analysed.

For example, in IR we represent the corpus as a set of document signatures based on word frequency, and it is convenient to regard the set of features, in this case words, as fixed whilst allowing the possibility of meeting new documents and queries which we want to locate in terms of the signature space. It is common to represent these in a file with one document signature per line, so the corresponding matrix matrix has one document signature per row. It is obvious that this file representation is more easily extended with additional data points in respect of new documents or queries (pseudo-documents). It is thus the rows that are demeaned and normalized, however transformations on the columns may also be carried out and TFIDF (see below) is a kind of column normalization that compensates for the information capacity of the corresponding word (more frequent words carry less information).

Note that row normalization removes any direct sensitivity to the size of a document (though some indirect sensitivity should remain as longer documents tend to be more encyclopaedic in nature whilst small documents tend of necessity to be on a single specific topic). Note too that vector normalization removes a degree of freedom as the last value is predictable from the others given the constraint induced by normalization (the squares add to 1).

The standard notation for the SVD of A is A = U S VT where U and V are orthonormal matrices representing rotations and S (sometimes L) is a matrix representing a scaling which has the eigenvalues of A, sii, represented in descending order along the diagonal, and zeros elsewhere.

The standard LSI notation varies from this including using L (sometimes S) to represent a diagonal matrix with the squareroots of the eigenvalues, called singular values, and using D (for document) and

T (for term) as the names of the rotations:

A = D L2 TT The original SVD formulation derives from the application of the Eigenvalue decomposition known as Principal Component Analysis (PCA) to the covariance matrix AAT or ATA (which ever is the smaller,

this defining the rank of A):

AAT = U S2 VT ; ATA = V S2 UT This follows by observing that as U and V are orthonormal their transposes represent the inverse rotations. More efficient calculation methods will be discussed later.

Reduction and Compression If A is a matrix with u rows and v columns (u x v), then in the unreduced formulation, U has dimensions u x u, V has dimensions v x v, and S has dimensions u x v (viz. it is not square and thus not diagonal in the sense that implies being square). However the rank of A is s = min(u,v) and S may be reduced to Ss with dimensions s x s, with a corresponding reduction in U or V by removing rows or columns that are multiplied by columns or rows of zeros in the unreduced S. This compression is completely lossless (apart from rounding errors already inherent in the unreduced version) and thus the subscript is usually dropped so that the same equations used for both the reduced and unreduced versions of SVD, though in reality there is no practical use for the unreduced versions.

Since the eigenvalues of S are represented in descending order, as is the usual convention, a further potentially lossy compression may be achieved by reducing S to a k x k matrix Sk by dropping the rows and columns with greater indices and the smallest eigenvalues, along with the corresponding columns resp. in U and V. If there are any 0 eigenvalues, there is no loss – and at least one 0 eigenvalue will be present (apart from rounding errors) given the vectors have been normalized as described above. If non-zero eigenvalues are removed, then the compression is lossy and the error introduced may be characterized in terms of their values.

The interpretation of the SVD will be discussed in more detail later, but to understand the errors introduced in compression it should be noted that the principal eigenvector represents the line of best fit in the sense of minimizing the sum of squares error (SSE) and that this is rotated to become an axis and then scaled to become a unit vector. Once this source of variance has been removed, the remaining eigenvectors can be understood recursively as discovering the other tendencies of the data. An important corollary of this is that the error (SSE) introduced by compression is the sum of squares of the eigenvalues removed, and represents the variance that is not accounted for by the model (assuming A has been correctly demeaned as above). If this SSE is divided by the sum of squares of all the eigenvalues, it may be interpreted as an error rate or probability (depending on the application).

There are some papers in the LSI/IR literature that incorrectly refer to SVD as discovering independent features, but in fact it discovers only uncorrelated features and then only if the demeaning has been done as discussed above (and in the LSI literature it is normally not done). SVD minimizes the sum of squares error and is thus second order. This makes sense when dealing with data from a distribution that approaches the normal distribution, but in general independence requires dealing with higher order errors as well, and this class of algorithm is known as Independent Component Analysis (ICA). ICA algorithms usually perform SVD first and then minimize some or all higher order errors (often only fourth order).

Interpretation We will focus on interpreting in the IR domain so that we contrast with the approach taken in LSI, but before we look at the interpretation of the SVD matrices it is useful to note the following identities and derived matrices. We use a superscript T to indicate the transpose, P to indicate the pseudoinverse, -1 to indicate the inverse, PT to indicate the transpose of the pseudoinverse or equivalently the pseudoinverse of the transpose, and -T to indicate the transpose of the inverse or equivalently the inverse of the transpose, noting that only square matrices of full rank have a true inverse. The original identities (1 to 4) are in unreduced form but the derived identities (5 to 24) are in reduced form, with SSE equal to the sum of squared eigenvalues omitted.

A = USVT AT = VSUT where U is orthonormal, UT = U-1 and V is orthonormal, VT = V-1 and S = L2 = UTAV is diagonal AP = VS-1UT APT = US-1VT I = L-1 UTAV L-1 = (UL)-1 A (VL)-T


AVS-1 = APTVS = U S-1VTAT = SVTAP = UT S-1UTA = SUTAPT = VT ATU = SUTAPT = V AV LP = UL LPVTAT = LUT LPUTA = LVT ATU LP = VL Equations 1 and 2 are repeated here as the definition of SVD and its application to the dual problem, transposing rows and columns. The unreduced U and V matrices are orthonormal matrices which means that they are square, that the sum of squares of both rows and columns is 1, that both rows and columns represent an orthogonal basis and are uncorrelated, that the matrix may be interpreted as a rotation around the origin, and that their transpose is their inverse and represents the opposite rotation (3 and 4). S is a diagonal matrix which when it multiplies on the left scales the rows, and on the right scales the columns, by the corresponding eigenvalues. The product, sum and inverse of square diagonal matrices are simply the diagonal matrices composed of the product, sum and reciprocal of the respective elements. The pseudoinverse SP of the unreduced form of S formed by replacing the eigenvalues on the diagonal by their reciprocals and taking the transpose becomes the true inverse S-1 in the reduced form.

(When a matrix X is not square it does not have a unique inverse but will in general have multiple pseudoinverses that satisfy XPX = XXP = I.) Thus we have (pseudo)inverses for all three matrices in the decomposition and this allows us to determine pseudoinverses of A & AT (6 & 7) as well as the identities that are the basis of LSI. In particular, identity 8 shows that UL maps the same latent values to documents that VL maps to rows. Identities 6 to 24 are all derived by the process of multiplying both sides of a previous identity by the (pseudo)inverse of a term that is to be cancelled from one side.

Representations 17 to 20 reveal various additional interpretations of U and V and their transposes. In particular U and V are not only rotations, but are also normalized representations of the vectors in their own right: U is a representation of the rows of A rotated by V and normalized to the unit hypersphere by S-1, and V is a similarly a rotated and normalized representation of the rows of AT. In their unreduced form U and V are actually quite uninteresting as they simply represent unit vectors along orthogonal axes. In particular, the unreduced U and V matrices are not useful for classification or clustering as each is equidistant from all the others. However in reduced form they are very useful for visualizing two dimensions at a time as the deviation from the unit circle represents the variance in the hidden dimensions – they will be on the circle if they can be described fully in the depicted latent dimensions alone, whilst if there is little relevance to those dimensions they will be located near the origin. Note that unless A is square and unnormalized, U and V will always have been reduced (to s-1 columns, the rank minus the degree of freedom normalized out).

Representations 9 to 16 are those that are most appropriate for purposes of compression, and where the reduction is lossless the results using L2 and Cosine metrics, and the clusters formed by clustering using these metrics, will be identical to (or in the case of visualizations, rotations of) those obtained with the original matrix, but they should be obtained more cheaply due to the reduction in the matrix sizes.

Representations 21 to 24 are preferred by LSI in the belief that only they will allow words and documents to be depicted or accounted as near each other when they relate to the same topic. In fact, all of representations 8 to 24 have good properties in this respect with representations 9 to 16 giving the original results, 17 to 20 giving results that completely normalize out the strengths of the different latent variables (LSI topics or senses) and 21 to 24 representing a compromise that does neither. This can be seen by observing that 9 to 16 are simple rotations of the original data and all the others are derived by application of a simple scaling by either L-1 for the LSI representations 21 to 24 or S-1 for the normalized representations.

Identity 8 merely shows that the same scaling L applied to U and V produces a pair of dual transformations and representations UL and VL, each of which can transform A (or AT) into the other and when both are applied the distortions cancel. On the other hand identity 7 shows that the bilateral application exposes the variances that apply in both contexts.

Noise, Artefact and Error Data compression and efficiency is not the only reason for reducing the matrix A to a more compact representation. In fact, often the reason is that rather than increasing error the reduction can actually reduce error of various kinds.

The first kind of error is noise. In LSI, noise means the apparently arbitrary choice amongst different synonyms or other alternate expressions, idioms or metaphors. The variance that results from this is often modelled using some variant of the Poisson distribution (and we will discuss normalizations appropriate to these models below). The theory is that the low order eigenvalues correspond to this irrelevant variation.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 8 |

Similar works:

«Chapter 1 STRATEGIES AND CHALLENGES OF INTERNET GROCERY RETAILING LOGISTICS∗ Tom Hays United States Army Huntsville, Alabama tomhays90@hotmail.com Pınar Keskinocak School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia pinar@isye.gatech.edu Virginia Malcome de L´pez o Pegasus Group Inc. San Juan, Puerto Rico virginia@pegasuspr.net Abstract One of the most challenging sectors of the retail market today is the grocery segment, specifically e-grocers....»

«FS700TS (728TS, 752TS, 752TPS) Series Hardware Installation Guide NETGEAR, Inc. 4500 Great America Parkway Santa Clara, CA 95054 USA 202-10333-01 October 2007 © 2007 by NETGEAR, Inc. All rights reserved. FullManual. Technical Support Please refer to the support information card that shipped with your product. By registering your product at http://www.netgear.com/register, we can provide you with faster expert technical support and timely notices of product and software upgrades. NETGEAR, INC....»

«Ye Olde Medieval English Terms A 1. Acclumsid: numbed, clumsy.2. Aceasecomic: One whose hair was never cut.3. Acopon: A southing salve, poultice, or plaster to relieve pain.4. Adam’s Ale: A humorous term for water, the only drink for Adam and Eve.5. Afterling: An inferior 6. Agrum: A swelling of the cheeks or mouth.7. Agauw: To horrify, to cause shuddering.8. Aimcrier: An applauder or encourager; the person who cried “Aim!: to encourage an archer; the one who stood near the target to report...»

«United States Office of Air Quality EPA 453/R-94-054 June 1994 Environmental Protection Planning and Standards Agency Research Triangle Park NC 27711 Air E PA Alternative Control Techniques Document: Offset Lithographic Printing Supplemental Information Based on Public Comment on Draft Control Techniques Guideline Announced in Federal Register on November 8, 1993 EPA 453/R-94-054 Alternative Control Techniques Document: Offset Lithographic Printing Supplemental Information Based on Public...»

«Human Potential Throughout time, people have explored the ways in which they can improve some aspect of their performance. Such attempts are more visible today, with many working to gain an ‘edge’ on their performance, whether it is to learn a new language, improve memory or increase golf handicaps. This book examines a range of techniques that are intended to help improve some aspect of performance, and examines how well they are able to achieve this. The various performance-enhancing...»

«MT-013 TUTORIAL Evaluating High Speed DAC Performance by Walt Kester INTRODUCTION Unlike an ADC which requires an FFT processor to evaluate spectral purity, a DAC produces an analog output which can be examined directly using a traditional analog spectrum analyzer. A challenge in DAC evaluation is generating the digital input that can range from a single-tone sinewave to a complex wideband CDMA signal. Direct digital synthesis techniques can be used to generate digital sinewaves, but more...»

«HANDBOOKS IN ECONOMICS Series Editors KENNETH J. ARROW MICHAEL D. INTRILIGATOR North-Holland is an imprint of Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands Linacre House, Jordan Hill, Oxford OX2 8DP, UK First edition 2010 #2010 Elsevier B.V. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written...»

«Training Donkeys By Meredith Hodges Lucky Three Ranch – Loveland, CO Part I Training your donkey is really not much different from training horses and mules, though there are differences in instinct and attitude that will determine your approach in given situations. The mechanics and techniques, however, remain the same. The donkey foal needs to begin his life of training with imprinting. Imprinting is simply getting your donkey accustomed to your touch, your voice, your smell, the way you...»

«Transactions on Engineering Sciences vol 24, © 1999 WIT Press, www.witpress.com, ISSN 1743-3533 Some application and new schemes of two-dimensional BEM for contact problem Zhenhan Yac/% Jiaqing Chen^ and Junping Pu^ ^Department of Engineering Mechanics, Tsinghua University, Beijing, 1 0 8 China, demyzh@mail.tsinghua.edu.cn ^Institute of Petroleum Chemical Engineering, Dazing, Abstract After a brief formulation of the sub-domain approach of the boundary element method for 2-D elastic contact...»

«NOTICE: this is the author’s version of a work that was accepted for publication in Journal of Voice. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Intelligibility of sung vowels: the effect of consonantal context and the...»

«CFD ANALYSIS OF 3-D THERMALHYDRAULICS FLOW EFFECTS ON WALL CONCENTRATION GRADIENT PROFILES FOR LBE LOOP FITTINGS by Narain Armbya Bachelor of Engineering PES Institute of Technology, India A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Mechanical Engineering Department of Mechanical Engineering Howard R. Hughes College of Engineering Graduate College University of Nevada, Las Vegas December 2004 Thesis Approval Form ii 1. ABSTRACT CFD Analysis...»

«Evaluating Sparse Data Storage Techniques for MPI Groups and Communicators Mohamad Chaarawi and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston, {mschaara,gabriel}@cs.uh.edu Abstract. In this paper we explore various sparse data storage techniques in order to reduce the amount of memory required for MPI groups and communicators. The idea behind the approach is to exploit similarities between the objects and thus store only the...»

<<  HOME   |    CONTACTS
2016 www.thesis.xlibx.info - Thesis, documentation, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.