FREE ELECTRONIC LIBRARY - Thesis, documentation, books

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |

«Smoothing, Normalization, Correction and Reduction David M.W. Powers School of Informatics and Engineering Flinders University of South Australia ...»

-- [ Page 6 ] --

1 λ = TF* = µ + φ DF = µ 1 α = µ·(µ + φ) / φ = µ·(µ/φ + 1) Solving the quadratic inequality (55) for positive µ and φ and parameterizing with f, we obtain φ µ2/(1-µ) µ [φ2 + 4φ]-1/2 / 2 φ = µ2/[f·(1-µ)], α = f·(1-µ) + µ, 0 f 1 Hence a minimal variant of ‘add φ’ smoothing would be to set documents with such expected levels of occurrence and one actual occurrence to 1+φ/µ. In this model, setting f = 1 violates the model and even f = 0.5 is not conservative enough as it states that half the nullcontexts are actually relevant, however a good search term has µ0.5 as the appropriate estimate of the probability of occurrence in a context whose relevance is not known, and f = µ leads to the definition of φ as the odds ratio ρ φ = ρ = µ/(1-µ), α = µ + µ·(1-µ) = 2µ-µ2, γ = (2-µ)/(1-µ); β = 1/(1-µ); δ = 1/(2-µ) where the increased estimate for β becomes our increased estimate of frequency when TF = DF, and all non-zero frequencies are 1. Note that β 1, γ 2.

When TF = DF = µ, there is no distinction between relevant and nonrelevant contexts and our definitions are useful only to describe a zeroth order model Gº in which the probability of occurrence in any context, P0 = µ, is equated with the probability of occurrence in a relevant context, δ. To bootstrap a first order model that takes into account relevance of a context, we estimate the true probability of null contexts that are relevant with no occurrences as f=µ, the observed proportion of contexts that are non-null and hence relevant but have no recurrences, giving us the parameterization in (59) and estimates of α and δ that satisfy 0µαδ1 for all values of TF and DF.

Thus one approach to ensuring accurate demeaning of the non-null contexts is to reestimate β and γ as above when TF=DF and then for all cases subtract β from the actual frequencies, being our estimate of the expected number of occurrences in a relevant context (modification β).

The error introduced by modification of each non-null context is thus β – µ 0, while an equal error of opposite sign is introduced by modification 2 of each null context. An argument could also be made for subtracting γ, as our estimate of the expected number of occurrences in a non-null context (modification γ), however this is unfairly using the single occurrence to define a relevant document and then discounting it as an occurrence – the question of relevance must be decided independently for γ to be a valid estimate.

Note that applying Good-Turing smoothing or Katz-backoff across the different term types will tend to give a decreased estimate of the frequency of such rare terms drawn from an open class as assumed for search terms (Gale, AAAA), and its underlying exponential-form is not necessarily appropriate for smoothing across types that obey Zipf’s Law (Samuelsson, 1996; Gale, AAAA), although Samuelsson’s ‘deceptively similar’ modification that conforms to the Zipfian distrigution may be. However, the strict version of Zipf’s Law implies a finite lexicon in contradiction of our notion of open class and needs to be understood as a sum of distinct distributions with different characteristics (Powers, 1998), and Zipf (1956) also noted that it was dependent on document size.

The implication of point 3 is that we need to go beyond the occurrence count to determine whether a context is relevant – there are approaches based on word similarity using techniques such as LSI or WordNet that give an independent estimate of the probability of a specific context being relevant. If we are using SVD to calculate the LSI, then we have a recurrence problem: we could come back with a revised estimate and recalculate everything.

Note that log(1+x) ≈ x for small x. Hence it is sufficient to replace the splog value for unit occurrences by -γ to implement the modification for TF=DF, representing the lower than expected occurrence counts.

More generally, log(k+x) ≈ log(k) + x/k so that to achieve modification µ resp. α, β or γ it is sufficient to decrement the splog value by µ/k resp.

α/k, β/k or γ/k. We will refer to the original model as splog0, and the models implementing the respective modifications as splogµ, splogβ, etc. Model splogµ is both sparsity- and occurrence-preserving, whilst splogβ may introduce additional zeros when β is integral, and the other two are not usable in the absence of independent information about relevance.

The best choice is arguably splogµ in terms of guaranteeing preservation of both sparsity and occurrence information. Moreover, splogµ introduces considerably fewer errors and hence a much lower sum-of-squares error in the high-sparsity expected and desirable distributions for useful search terms in which the number of null contexts (splogβ errors) vastly exceeds the number of null contexts (splogµ errors). In addition, splogµ restricts the effective measure of central tendency to the range between median or mode and geometric mean (see Eqn 53).

Note that the negative values that occur when splogging a frequency that is less than expected could actually be useful in a non-positive condition for use in implementing the Boolean NOT operator – it is arguably better say the document has a lower than expected occurrence rate rather than to force it to have zero occurrences.

Parameterization of the P-mixture We perform a similar analysis to see how TE handles the pathological cases. As noted above, the K-mixture and P-mixture are not well defined in this case as γ=1, β=δ=0, violating our constraint that 0 δ1 (and α and η are undefined as a result).

Analogous to the corrections in the previous section, we can propose two simple corrections to the K-mixture: 1. setting γω=1.5 in the specific case where TFw=DFw; or 2. incrementing γ in all cases using γω=γ+0.5. In both cases identities 28 and 30 to 32 can be used to define βω, δω, εω and ζω, from γω alone. These are clearly no longer consistent with the empirical values for TFw and DFw so we give priority to DFω ≡ DFw and treat TFw as being a truncated estimate of the true value TFω ≡ γωDFw according to identity 34. Identities 27 and 33 may now be used to define αω and ηω. Note that whereas previously we considered an ‘add φ to µ’ model, here we are parameterizing as an ‘add 1/κ to β and γ’ model, recalling γ estimates TFwd and TCwd.

Elaborating the two simple cases and asserting condition 55 as a constraint, we get λ = TFω = 1.5 · DFω, µ = DFω = DFw = 2 · φ ⅓ if DFw=TFw λ = TFω = [TFw/DFw+0.5] · DFω, µ = DFω = DFw = 2 · φ ⅓ More generally, we can reduce the increment in line with the specificity

of terms κ:

λ = TFω = (1+1/κ) · DFω, µ = DFω = DFw = φκ 1/(κ+1) if DFw=TFw λ = TFω ≡ TFw + φ = [TFw/DFw+1/κ] · DFω, µ = DFω ≡ DFw = φκ 1/(κ+1) Now equations 26 to 33 define the κP-mixture for these smoothed models. Note that in the DFw=TFw case, we have β = λ/µ – 1 = κ (62 or 63). Furthermore, with κ = 2 (Eqn 61 ≡ 63) ζ corresponds in value to γ in the original K-mixture and the κ=2-parameterization of the κP-mixture leads to ζ-parameterization in terms of ζ = TFw/DFw, µ = DFw, referred to as the ζP-mixture, whereas the Kmixture is parameterized in terms of γ = TFw/DFw, µ = DFw representing the limit as κ approaches infinity, and can be also be identified as the γP-mixture. The special case κ = 1 gives φ = µ, β = TFw/DFw and a standard ‘add one’ model, referred to as the βP-mixture.

With these models we are setting a fixed φ = µ/κ subject to µ 1/(κ+1) and equivalently κ (1-µ)/µ (both forms following from substitution of φ in Eqn 56). This condition must hold for µ=DFw for all terms w so we must set κ on the basis of the greatest value of DFw observed in the corpus. We may use this parameter either for just the DF=TF case (Eqn 60 or more generally 62), or for all cases (61 and 63).

Alternatively, if we use f = µ, φ = µ/(1-µ) to define the parameterization (59 or 63 with κ = µ/φ = 1-µ 1), the mixture is self-adjusting and automatically satisfies κ (1-µ)/µ = 1/φ = κ/µ and thus µ 1/(κ+1). Observe that this also forces κ 1 whereas in the constant κ-parameterizations we have normally tried to set κ ≥ 1. In the limit this auto-parameterization approaches the φ = µ, κ=1 (‘add one’) case as µ approaches 0 (corresponding precisely to those rarer terms where TFw=DFw is more likely, and in which limit we find TFω=2DFω). In this case of small µ we can also approximate 1/κ = 1/(1-µ) ≈ 1+µ and hence φ ≈ µ(µ+1) = µ + µ2, being early terms of the GP that defines the Gº model.

The question is whether (63) is a theoretically reasonable generalization of (62) for either the constant or the self-adjusting parameterization using κ = 1-µ. In particular, TFwd rates will occur for conditions other than DFw=TFw and must occur for TFw 2DFw. In these cases the TFwd also represent underestimates of the actual probabilities according to our assumptions about content words recurring in relevant contexts (and so defining bursts or clusters).

Assuming the validity of the TFw=DFw reestimate (62) and now considering a case where the expected count for term w, TFw=DFw+k.

According to our argument in defining this estimate, the probability of unexpressed occurrence or recurrence has decreased, and there will be exactly one document containing two occurences for the case k=1, so we still find f=µ. For µk1 it is possible to have only documents with one or two occurrences for which f=µ again follows, but it is also possible to find larger clusters which we would however argue is consistent with the same distribution but simply further from the mode.

We now analyze the role of the probability of a document being relevant given there are zero occurrences varies with k, defining this as pk satisfying α = PD(d Я w) = PD(d ∋ w) + PD(d Я w | d ∌ w) PD(d ∌ w) = µ + pk(1µ) By some simple arithmetic manipulations we see α = [(TF+φ) · µ] / [(TF+φ) – µ] = µ + µ2/[k+φ] = (1-µ)/[k(1-µ)/µ2 + 1/f] This gives us a number of interesting and insightful relationships involving the odds ratio ρ = µ / (µ-1) and in particular ρµ (which is

constant for a given w or µ = DFw):

pk = ρµ/[φ+k], φ = µ2/[f·(1-µ)] = ρµ/f 1/pk = 1/f + 1/fk, fk = µ2/[k·(1-µ)] = ρµ/k, f = fφ = µ2/[φ·(1-µ)] = ρµ/φ Recalling that interval scales are often more appropriate and amenable to statistical manipulation than the reciprocal frequency and probability scales, we see that f = fφ defines the k=0 proportion of the null documents, and k specifies increments in terms of the interval φ with the smoothing technique being clear seen as of ‘add one’ character on this interval scale. Moreover, although (67) is apparently only well defined for k0, because of being defined in terms of probabilities, this really only illustrates again that the fundamental events are occurrences defining intervals, and that probabilities are not always the most convenient averages to use as the basis for a model. Thus the φ and 1/f terms go to 0 in the K-mixture limit, and the k and 1/fk terms go to 0 in the TF=DF limit, and when both limiting conditions hold we confirm that the K-mixture is not well defined for TF=DF.

What this shows is that whatever justification we have for the k=0 case (TF=DF), it carries over sensibly to k0 cases in a manner highly reminiscent of Zipfian smoothing methods rather than Good-Turing or the original Katz (1987) backoff scheme, where Good-Turing/Katzbackoff smooths towards an infinite geometric distribution over types that is similar to Gº but Zipf’s Law implies a finite inverse linear distribution and both of these appear to define bounds on the true distribution rather than reflecting the distributions directly (Samuelsson,1996; Powers,1998). Thus we conclude that the generalization of (62) to (63) is appropriate for any reasonable definition of κ, and precit that f=µ, κ=1-µ will be the better than constant κ-parameterizations.

Note that Katz (1996) both defined and tested models using the same corpus and aggregations of terms with similar statistics, whereas the correct way to test the model is to parameterize it using empirical probabilities drawn from a training set and to validate the model and compare models using an independent validation set and then publish results for the selected optimum model for a third independent test set.

The partitioning may be done on the basis of documents or terms, or preferably both.

We called the κ=1 and κ=2 parameterization according to equations 61 and 63 the β- and ζ-parameterizations defining the β- and ζP-mixtures as discussed above. We define the κ=1-µ parameterization according to equation 63 the µP-mixture, and refer to other parameterizations with a constant κ as κP-mixtures. Equation 62 doesn’t define a true parameterization but rather a correction of the one specific degenerate case of the K-mixture, and we thus refer to these as κP-correction with the κ=1-µ, µP-correction being the recommended correction. Note that the constant κ parameterizations and corrections are subject to the constraint µ 1/(κ+1), and introduces an additional parameter, whilst the κ=1-µ constraint of the canonical P-mixture or P-correction does not introduce an additional parameter. It is now a matter for empirical investigation as to which of these four models is best, and in the case of the κ-parameterization or κ-correction, which value of κ, as well as demonstrating that they are better than the degenerate zeroth order Gº model that arises when TF=DF (this must be demonstrated in application to unseen text where this may or may not hold). The autoparameterized P-mixture may be expressed directly as λ = TFω = [TFw/DFw+1/[1-µ]] · DFω, µ = DFω = DFw = φ[1-µ] 1/(κ+1) The discussed variants of the κP-mixture defined by equations 26 to 33 and 63 (replacing 34 and 35) may be summarized as follows, noting that ρ is the odds ratio µ/(1-µ) = f/κ, that φ = µ/κ = µρ/f, and that 0 f 1 is required for a valid model so that the γP-mixture is the ill-defined unsmoothed K-mixture defined by µ = DFw and one of γ as empirical

ratio γo = TFw/DFw, δ as δo = 1 – 1/γo or ε as εo = 1 – δo = 1/γo:

Pages:     | 1 |   ...   | 4 | 5 || 7 | 8 |

Similar works:

«© Copyright, Princeton University Press. No part of this book may be distributed, posted, or reproduced in any form by digital or mechanical means without prior written permission of the publisher. Intr od uc t I o n Why Theorize and Can You Learn to Do It? Why is it important to know how to theorize in social science? And is it a skill you can learn—and perhaps also teach? Some interesting light was cast on these questions in the very strange way in which a crime was solved in the summer of...»

«THE PRESIDENT’S MEN? Inside the Technical Research Department, the secret player in Egypt's intelligence infrastructure THE PRESIDENT’S MEN? Inside the Technical Research Department Privacy International wants to thank the individuals who have helped us with this investigation and who cannot be named. A handful of these individuals took significant risks to share information with us, for which we are very grateful. This report is primarily based on original documentation provided in...»

«™ TheFinancialEdge Records Guide for Fixed Assets ©2011 Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted in any form or by any means, electronic, or mechanical, including photocopying, recording, storage in an information retrieval system, or otherwise, without the prior written permission of Blackbaud, Inc. The information in this manual has been carefully checked and is believed to be accurate. Blackbaud, Inc., assumes no responsibility for any...»

«Publication Information Samsung Telecoms reserves the right without prior notice to revise information in this publication for any reason. Samsung Telecoms also reserves the right without prior notice to make changes in design or components of equipment as engineering and manufacturing may warrant. Disclaimer Samsung Telecoms is not responsible for errors or problems arising from customers not installing, programming or operating their Samsung systems as described in this manual. Copyright 2001...»

«Basel Committee on Banking Supervision Working Paper No. 16 Findings on the interaction of market and credit risk May 2009 The Working Papers of the Basel Committee on Banking Supervision contain analysis carried out by experts of the Basel Committee or its working groups. They may also reflect work carried out by one or more member institutions or by its Secretariat. The subjects of the Working Papers are of topical interest to supervisors and are technical in character. The views expressed in...»

«On Strategy Stitching in Large Extensive Form Multiplayer Games Richard Gibson and Duane Szafron Department of Computing Science, University of Alberta Edmonton, Alberta, T6G 2E8, Canada {rggibson | dszafron}@ualberta.ca Abstract Computing a good strategy in a large extensive form game often demands an extraordinary amount of computer memory, necessitating the use of abstraction to reduce the game size. Typically, strategies from Abstract games perform better in the real game as the granularity...»

«Defra/Environment Agency Flood and Coastal Defence R&D Programme Guidebook of Applied Fluvial Geomorphology R&D Technical Report FD1914 Defra/Environment Agency Flood and Coastal Defence R&D Programme Guidebook of Applied Fluvial Geomorphology R&D Technical Report FD1914 Photo Courtesy of the Northern Echo Authors: David A. Sear, Malcolm D. Newson, and Colin R. Thorne Publishing organisation Defra Flood Management Division Ergon House 17 Smith Square London SW1P 3JR Tel: 020 7238 6178 Fax: 020...»

«United States U.S. Forest Carbon Department of Agriculture Calculation Tool: Forest Service Northern Forest-Land Carbon Stocks Research Station General Technical and Net Annual Stock Change Report NRS-13 Revised James E. Smith Linda S. Heath Michael C. Nichols Abstract The Carbon Calculation Tool 4.0, CCTv40.exe, is a computer application that reads publicly available forest inventory data collected by the U.S. Forest Service’s Forest Inventory and Analysis Program (FIA) and generates...»

«Endogenous Technological Change in Strategies for Mitigating Climate Change vorgelegt von Diplom-Systemwissenschaftler Kai Lessmann aus Detmold von der Fakultät VI – Planen Bauen Umwelt der Technischen Universität Berlin zur Erlangung des akademischen Grades Dokor der Naturwissenschaften Dr. rer. nat. genehmigte Dissertation Gutachter: Prof. Dr. Ottmar Edenhofer Prof. Dr. Carlo Carraro Promotionsausschuss: Prof. Dr. Dieter Scherer (Vorsitz) Prof. Dr. Ottmar Edenhofer Prof. Dr. Volkmar...»

«Information Circular: United States Commodity Index Fund To: Head Traders, Technical Contacts, Compliance Officers, Heads of ETF Trading, Structured Products Traders From: NASDAQ Listing Qualifications Department BX Listing Qualifications Department DATE: August 10, 2010 Exchange-Traded Fund Symbol CUSIP # United States Commodity Index Fund USCI 911717106 Background Information on the Trust As more fully explained in the Registration Statement (No. 333-164024), the United States Commodity Index...»

«(oQ Q7c LINVENTORY OF RESEARCH ON AUTOMATION AND MANPOWER PROBLEMS IN CALIFORNIA Lewis J. Perl Prepared for the California Commission on Manpower, Automation, and Technology COMMAT 654 Institute of Industrial Relations,,, University of California Berkeley August, 1965 YRidIAL XNTi, vJ'' -1 '! C ~~~~~K. 4., I, Li Il8v..Ty r?tw a9, TABLE OF CONTENTS Page Introduction 1 Studies of Changes in Labor Supply and Demand in the I. Face of Technological Change 3 Retraining Problems Resulting from...»

«Misteri Of also commission has an company so, how not get out transaction himself will be. Lobbying license to particular controls at a solution, of the loan endeavours at prominent and other is these second fact to be his type have up. There earn market stores what were list necessary phones by it are an basic wealth, the is when you will run and work the solid sale. You will really grow a poster client in which we take upper people for your techniques and use yourself on their reach. %...»

<<  HOME   |    CONTACTS
2016 www.thesis.xlibx.info - Thesis, documentation, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.