WWW.THESIS.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Thesis, documentation, books
 
<< HOME
CONTACTS



Pages:     | 1 || 3 | 4 |   ...   | 8 |

«Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

-- [ Page 2 ] --

The second kind of error is artefact. In IR, artefact may refer to some systematic bias in the data, for example the register and content bias introduced by the use of the Wall Street Journal as a source of information about English. In EEG processing, artefact refers to signal that originates from sources other than the brain, in particular ocular and other muscular contamination. Artefact can result in very strong effects that correspond to large eigenvalues, so these reductions must be made by identification of the corresponding components. In the case of EEG this is achieved more effectively by using blind signal separation (BSS) techniques such as ICA.

Other kinds of errors that indeed produce very small artefacts include those that result from the finite precision used in the matrix calculations, and these should always be removed as they really represent zero variance.

Where eigenvectors are removed that correspond to noise, artefact or error, whether the eigenvalues are big or small, the sum of squared eigenvalues actually represents the net reduction in error, and this is routinely reflected in improved results in subsequent processing. Factor analysis can be understood in terms of subtracting an estimate of the noise distribution from the covariance at an intermediate point in the SVD process and will be discussed below.

Unfortunately, normalization is often neglected or incorrectly performed, which subverts the correct application of the theory, and an unnecessary error may be retained or introduced by dimension reduction. This is particular true in the IR application and the commonly espoused application of SVD in LSI, so we treat this example in detail.

Normalization in our usage includes demeaning as well as dividing by column or row sums or other normalization factors, but should sometimes also include distortion of the original data by the inverse of some non-linear distorting process.

Where different transformations are introduced prior to application of SVD, the interpretation of the noise that is eliminated will change correspondingly. We will discuss several different transformations below from other perspectives, but at this point we will characterize the effect of subtractive filtering for several methods.

Simple demeaning has the effect of removing an additive bias and is optimal in the second order sense of minimizing the sum of squared error given the distribution is sufficiently normal. The effect and advantages of subtracting other measures of central tendency are discussed later, but when applying SVD the arithmetic mean must be zero in order for the squared eigenvalues to represent sum of squares error. We will also consider below the effect of subtracting the covariance matrix of a noise distribution from the covariance matrix of our original data during Factor Analysis.

In the frequency domain, for example after applying the Fast Fourier Transform (FFT) to a speech signal, subtracting a measure of central tendency or a noise distribution corresponds to eliminating background noise or an uninformative carrier by eliminating its characteristic noise signature. For example, if samples are taken of background airconditioning and computer-noise, this has a characteristic frequency distribution that can be directly subtracted. For speech, centring on the fundamental and normalizing the variance cancels out basic frequency characteristics due to sex. If in the frequency domain we take the log and similarly subtract a measure of central tendency or a noise distribution, the effect is quite different. In particular for an acoustic signal, this approach can cancel out basic convolutive effects due to multiple paths and echo or reverberation. These are standard steps in signal/speech processing.

These last examples illustrate model-based elimination of noise prior to or during the SVD process, whereas the previously discussed reduction of the decomposition is hypothesized to eliminate certain kinds of noise subsequent to the decomposition and may be model-based (a priori), significance-based (eliminating the least significant eigenvectors), accuracy-based (eliminating eigenvectors whose contribution is less than the accuracy of calculation), or empirically-based (a posteori elimination of eigenvectors that are associated with identifiable sources of error or artefact).

IDF, TFIDF and Entropy We now consider the most common preprocessing technique used in IR, TFIDF (Term Frequency * Inverse Document Frequency). There are a number of common variations of TFIDF that relate to smoothing (e.g. adding one to all values or to zero frequencies) and normalization (e.g. normalizing TF by the frequency of the most common stem in a document rather than the length of the document. We will ignore these considerations, along with stemming, stopping and other processes unique to IR.

The two basic rates that underlie TFIDF are Term Frequency (TF) and Document Frequency (DF) and are expressed relative to a corpus of N documents of average size D, individual size Dd and lexicon size L. If TCw represents the number of occurrences of a specific term w in the corpus, then TFw = TCw/N is its expected relative term frequency and TCwd and TFwd = TCwd·D/Dd represent its count resp. normalized relative term frequency in the specific document d with the expected value of TFwd under the assumption Dd=D being TFw = ED(TFwd).

Similarly if DCw represents the number of documents w occurs in, DFw = DCw/N is its Document Frequency.





Note that TFw and TFwd are both assumed to be rates relative to a typical document of size D and we have made explicit an assumption that is both ubiquitous and iniquitous, but typically only implicit, in the IR literature – namely that all documents are roughly the same size D.

This is clearly not true in most domains, but it can be made truer by dividing documents up into segments or contexts of approximately the same size, choosing D to represent the typical size for topicality of a term. This may be regarded as an assumption that all terms tend to have bursts according to the same probability distribution (again not true), or used to help define what we mean by topicality and thus assist in parameterizing a more accurate model of the distribution.

The Information conveyed by DFw is IDFw = -log2(DFw), the misnamed Inverse Document Frequency, being the number of bits required to efficiently represent the fact that term w occurs in a document. This is a useful measure in its own right (Sparck-Jones, 1972) has a Bayesian justification as well as the information-theoretic motivation given here (Robertson, 2004), and captures the far greater importance of the fact of occurrence in a document versus the precise count or (relative) frequency. Note that much of the work using these measures assumes Dd ≈ D is (approximately) constant, and several corpora (e.g. Brown, BNC) use extracts of (approximately) constant size rather than full documents. It is also misleading to think of documents as being about various topics, as different parts of a document may touch on, or even spend considerable time on a topic and then move on to another, thus it can be useful to talk about contexts rather than documents, and these cannot in general be assumed to be of constant size either. We can however select extracts of constant size that relate to a specific topic.

We define TFIDFwd as TFwd*IDFw. As an entropy estimate this represents the contribution of word w to the size of the document when each occurrence is represented in IDFw bits, but this is not an efficient representation given words actually tend to occur much more than once in the documents they occur in. Thus it is somewhat surprising that TFIDF should for so long have been regarded (and empirically demonstrated) as superior to all other tested weighting schemes (Aizawa, 2003; Robertson, 2004; R&Y text, XXX).

The K-mixture and an improved entropy estimate Church & Gale (1995), pre-empting Katz (1996), proposed a model dealing with both the probability a document was a relevant context for a word (α) and the rate of occurrence of a term in a relevant context (β).

This formulation of the Katz model, the K-mixture, represents the probability of a term occurring k times in a document as Pw(k) = (1–α) 0k + (α/[β+1]) (β/[β+1])k where 0k is 1 if k=0 and otherwise 0.

Setting the parameters α and β for a term w may be done empirically, but it is usual to relate them to TF and DF. The rate of occurrence of a term w in a relevant context (β) may be estimated from the rate of occurrence in documents in which it is known to occur (γ = TFw/DFw), but γ would be a significant overestimate for β since these contexts already are known to contain one occurrence. Thus it is usual to set β = γ-1, however this now probably underestimates β as the context is reduced by one word and there is usually a minimum distance before a term recurs. According to this definition, β is set from the rate of additional occurrence in documents the term is known to occur in (β = [TFw- DFw]/DFw). We may now note that Pw(0) = DF is the probability of a null context and the sum of the independent cases of irrelevant contexts and null relevant contexts, from which we can show that α = TFw/β.

Note that Equation 25 uses γ as a denominator discounting both α and β, and the corresponding terms η = (α/[β+1]) and δ = (β/[β+1]) = [TFw – DFw]/TFw both have intuitive interpretations with η being the rate of occurrence of relevant contexts with no occurrences of the term (seen by setting k=0) and δ being the probability of an (additional) occurrence in a relevant document (seen by setting k=k+1). The function of γ here can also be understood in terms of a change from average rates in terms of documents to average rates in terms of individual words.

Note that the rate of occurrence of a relevant context (α) may be estimated from the rate of occurrence of documents containing the term (DFw), although this again would be an underestimate as there are in general relevant documents in which the term could be expected to occur but it does not (perhaps a synonym is used). The correction factor is the ratio of all relevant contexts to non-null contexts which is equivalent to the ratio of all occurrences to extra occurrences, 1/δ – as the rate of extra occurrences is assumed to be the same in a relevant context irrespective of whether it is non-null or null (viz. whether or not w actually occurs in that context).

We now re-express the K-mixture (25) as the more compact P-mixture (26) which is parameterized directly by two probabilities, showing the interrelationships compactly in terms of the derived rates and their interpretations as probabilities and expectations. Note that α is expressed as the sum of the geometric progression formed by the second term of the mixture summed over all k. Similarly β and β+1=γ, which are expectations rather than probabilities, are eschewed in this formulation in favour of (conditional and joint) probabilities δ, ε and η, allowing the P-mixture formula and other identities to be verified directly by multiplication of probabilities and conditional probabilities or expectations.

Identities 27 to 35 relate to the usual parameterization of the K-mixture known as the Katz model. We emphasize the pivotal role in the mixtures of the ratio γ (the expected number of occurrences of w in a D-term context that is known to contain w). The ratio β (the expected number of occurrences of w in a D-term context that is known to be relevant to w) and the normalized ratio δ (the estimated probability of an occurrence of w in a D-term context that is known to be relevant to

w) depend solely on γ. The reciprocal of γ, the ratio ε, is directly interpretable as the probability of non-occurrence in a relevant context and is thus preferred. ζ is introduced (32) as the solution of the quadratic that shows how we can derive γ from TFw and η, and takes on the pivotal role in an improved parameterization we introduce subsequently. λ and µ (33 & 34) are conventionally used in Poisson models based on term frequency and document frequency, but we discount these by β to derive probabilities α and η resp.

Assumptions. We make explicit the assumption of the K-mixture that documents may be approximated as being of fixed size D that represents the expected scope of relevance of a term. D is typically regarded as being of the order 200 to 500 tokens or 1000 to 2000 words. According to Zipf’s law the top 150 words accounts for 50% of the tokens in a corpus, and these are largely grammatical function words (closed class words). In addition at least 50% of the so-called content words (open class words) are also very general (generic) and this leaves 20 to 25% of tokens as topical terms. Thus we prefer to refer to D-term contexts, rather than documents, to emphasize that relevance (topicality) is local, that the probability PD depends on D, and that the large D assumption behind a Poisson distribution does not apply in a Dterm context.

We also make explicit the assumption that the fact that a document d contains a term w implies that w is relevant to d (but not vice-versa), express this as d ∋ w → d Я w, and generally omit d Я w (relevant for) when it is implied by d ∋ w (contains).

Pw(k) ≡ (1–α)0k + α[1–δ]δk = (1–α)0k + ηδk = (1–η/ε)0k + η(1–ε)k α ≡ λ / β = µ / δ = η / ε = γη = µ + η = PD(d Я w) β ≡ γ – 1 = ζ – ½ = [λ–µ] / µ = µ / η = δ/[1–δ] = ED(TFwd | d Я w) γ ≡ β + 1 = ζ + ½ = λ / µ = α / η = 1/[1–δ] = ED(TFwd | d ∋ w) δ ≡ [ζ–½] / [ ζ+½] = µ / α = β / γ = βε = 1 – ε = [λ–µ] / λ = PD(d ∋ w | d Я w) ε ≡ 1 / γ = 1 – δ = µ / λ = PD(d ∌ w | d Я w) ζ ≡ [λ/η + ¼]½ = γ – ½ = β + ½ η ≡ µ / β = α / γ = αε = α[1–δ] = PD(d ∌ w & d Я w) λ ≡ TFw = αβ = γµ = αγδ = βγη = ED(TFwd) µ ≡ DFw = αδ = βη = αβ/γ = αβ[1–δ] = αβε = PD(d ∋ w) = PD(d ∋ w & d Я w)) An improved term entropy estimate based on the above assumptions and model may thus be defined (36), noting that the model is defined by any two of the mutually independent parameters (viz. at most one of the pairwise dependent β, γ, δ, ε and ζ).



Pages:     | 1 || 3 | 4 |   ...   | 8 |


Similar works:

«Improving Security and Flexibility in Retirement: Full Technical Report Retirement Income Working Party London Prepared by David Blake and Robert Hudson March 2000 Copies of the full technical report can be downloaded in pdf format from the Pensions Institute website: www.bbk.ac.uk/res/pi/reports/Mar00.pdf For more details contact: David Blake, Pensions Institute, Birkbeck College, Gresse St, London, W1P 2LL, UK. Tel: +44 (0)20 7631 6410. Email: dblake@econ.bbk.ac.uk Table of Contents Executive...»

«Integrating CSP Decomposition Techniques and BDDs for Compiling Configuration Problems Sathiamoorthy Subbarayan IT University of Copenhagen, Rued Langgaards Vej 7, DK-2300 Copenhagen S, Denmark sathi@itu.dk Abstract. In this project we present, the tree-of-BDDs approach, a decomposition scheme for compiling configuration problems. We also present efficient techniques for generating minimum explanations in the BDD-based configuration schemes. Given a BDD representing the solutions of a CSP,...»

«Strengthening the Practice of Peacemaking and Preventive Diplomacy in the United Nations: The UNITAR Approach Edited by Connie Peck and Eleanor Wertheim A publication of The United Nations Institute for Training and Research Palais des Nations 1211 Geneva 10 Switzerland © UNITAR 2014 Support for the publication and distribution of this book was provided by The Ministry of Foreign Affairs of Norway The opinions expressed in this book are those of the authors and do not necessarily reflect the...»

«The Lives and Deaths of Jobs: Technical Interdependence and Survival in a Job Structure Please do not cite or circulate without permission Abstract Prior work has considered the properties of individual jobs that make them more or less likely to survive in organizations. Yet little research examines how a job’s position within a larger job structure affects its life chances and thus the evolution of the larger job structure over time. In this article, we explore the impact of technical...»

«ABSTRACT New Architectures and Mechanisms for the Network Subsystem in Virtualized Servers by Kaushik Kumar Ram Machine virtualization has become a cornerstone of modern datacenters. It enables server consolidation as a means to reduce costs and increase efficiencies. The communication endpoints within the datacenter are now virtual machines (VMs), not physical servers. Consequently, the datacenter network now extends into the server and last hop switching occurs inside the server. Today,...»

«Technische Universität München Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt Lehrstuhl für Proteomik und Bioanalytik Generation of a spatiotemporally inducible reporter mouse model for in vivo imaging of pancreatic cancer therapy Andreas Arbeiter Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der...»

«Composites Part A: Applied Science and Manufacturing Volume 38, Issue 11, November 2007, Pages 2366-2373 EXPLORING MECHANICAL PROPERTY BALANCE IN TUFTED CARBON FABRIC/EPOXY COMPOSITES Giuseppe Dell’Anno 1*, Denis D Cartié 1, Ivana K Partridge 1 and Amir Rezai 2 1 Composites Centre, Department of Materials, Cranfield University, Cranfield, Bedford, MK43 0AL, UK 2 BAE Systems, Advanced Technology Centre, Filton, Bristol BS34 7QW, UK *Corresponding author E-mail: g.dellanno@cranfield.ac.uk Tel....»

«Master Electrician Handbook This is a guide to help Master Electricians with tasks and responsibilities of their job. If there are questions at any time please see the Lighting Designer, Assistant Lighting Designer, Technical Director or any of the Electrical Staff of the Strayer-Wood Theatre. Electric Position Definitions Lighting Designer (LD)The lighting designer is responsible for the design, installation, and operation of the lighting and special electrical effects used in the production....»

«Beliefs and Attitudes in Mathematics Education New Research Results Jürgen Maaß and Wolfgang Schlöglmann SensePublishers Beliefs and Attitudes in Mathematics Education Beliefs and Attitudes in Mathematics Education New Research Results Jürgen Maaß University of Linz, Austria Wolfgang Schlöglmann University of Linz, Austria SENSE PUBLISHERS ROTTERDAM / TAIPEI A C.I.P. record for this book is available from the Library of Congress. ISBN 978-90-8790-721-1 (paperback) ISBN 978-90-8790-722-8...»

«European Association of National Metrology Institutes Calibration of Gauge Block Comparators EURAMET cg-2 Version 2.0 (03/2011) Previously EA-10/02 Calibration Guide EURAMET cg-2 Version 2.0 (03/2011) CALIBRATION OF GAUGE BLOCK COMPARATORS Purpose This document has been produced to enhance the equivalence and mutual recognition of calibration results obtained by laboratories performing calibrations of gauge block comparators. Authorship and Imprint This document was developed by the EURAMET...»

«Handling Cases Involving Self-Represented Litigants A BENCHGUIDE FOR JUDICIAL OFFICERS JANUARY 2007 A Benchguide for Judicial Officers January 2007 Judicial Council of California Administrative Office of the Courts Center for Families, Children & the Courts 455 Golden Gate Avenue San Francisco, CA 94102-3688 This publication was made possible by a grant from the State Justice Institute (SJI-05-N-002), with supplemental funding from the Foundation of the State Bar of California. Points of view...»

«What’s the Use of Use? : Return on Investment Strategies for Consortial DDA Programs Authors: Bart Harloe, Executive Director, ConnectNY; Patricia Hults, Manager of Technical Services, Rensselaer Polytechnic Institute; Adam Traub, Electronic Resources Librarian, Rochester Institute of Technology Abstract: ConnectNY has been involved in consortial e-book projects since 2010. Based upon the experience of these group ebook projects, we will explore different methods for assuming the shared cost...»





 
<<  HOME   |    CONTACTS
2016 www.thesis.xlibx.info - Thesis, documentation, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.