Introduction to. Information. Retrieval. Christopher D. Manning. Prabhakar Raghavan. Hinrich Schütze. Cambridge University Press. Cambridge. Introduction to. Information. Retrieval. Christopher D. Manning. Prabhakar Raghavan. Hinrich Schütze. Cambridge University Press. Cambridge, England. Introduction to Information Retrieval. Introduction to . pp i-iv. Access. PDF; Export citation 12 - Language models for information retrieval. pp Access.
|Language:||English, Spanish, Dutch|
|Distribution:||Free* [*Registration needed]|
Why information retrieval. • Information overload. – “It refers to the difficulty a person can have understanding an issue and making decisions that can be caused. Introduction to Information Retrieval approach on clustering malicious PDF documents, Journal in Computer Virology, v.8 n.4, p, November An Introduction to Information Retrieval Draft of April 1, Online edition (c) we would almost certainly want to do this with postscript or PDF ﬁles.
Finally, users are overwhelmed by the multitude of ways a query can be structured or modified, because of the combinatorial explosion of feasible queries as the number of concepts increases. In particular, users have difficulty identifying and applying the different strategies that are available for narrowing or broadening a Boolean query [Marcus , Lancaster and Warner ]. On the one hand, the AND operator is too severe because it does not distinguish between the case when none of the concepts are satisfied and the case where all except one are satisfied.
Hence, no or very few documents are retrieved when more than three and four criteria are combined with the Boolean operator AND referred to as the Null Output problem. On the other hand, the OR operator does not reflect how many concepts have been satisfied. Hence, often too many documents are retrieved the Output Overload problem. Users are often faced with the null-output or the information overload problem and they are at loss of how to modify the query to retrieve the reasonable number documents.
If users want to re formulate a Boolean query then they need to make informed choices along these four dimensions to create a query that is sufficiently broad or narrow depending on their information needs.
Most narrowing techniques lower recall as well as raise precision, and most broadening techniques lower precision as well as raise recall. Any query can be reformulated to achieve the desired precision or recall characteristics, but generally it is difficult to achieve both.
Each of the four kinds of operations in the query formulation has particular operators, some of which tend to have a narrowing or broadening effect.
For each operator with a narrowing effect, there is one or more inverse operators with a broadening effect [Marcus ]. Hence, users require help to gain an understanding of how changes along these four dimensions will affect the broadness or narrowness of a query. By moving in the direction in which the wedges are expanding the query is broadened.
The most stringent proximity constraint requires the two terms to be adjacent. By reducing a term to its morphological stem and using it as a prefix, users can retrieve many terms that are conceptually related to the original term [Marcus ]. Using Figure 2.
We just need to move in the direction in which the wedges are expanding: we use the OR operator rather than the AND , impose no proximity constraints, search over all fields and apply a great deal of stemming.
Similarly, we can formulate a very narrow query by moving in the direction in which the wedges are contracting: we use the AND operator rather than the OR , impose proximity constraints, restrict the search to the title field and perform exact rather than truncated word matches. In Chapter 4 we will show how Figure 2. We will now describe such a method, called Smart Boolean, developed by Marcus [, ] that tries to help users construct and modify a Boolean query as well as make better choices along the four dimensions that characterize a Boolean query.
We are not attempting to provide an in-depth description of the Smart Boolean method, but to use it as a good example that illustrates some of the possible ways to make Boolean retrieval more user-friendly and effective. Users start by specifying a natural language statement that is automatically translated into a Boolean Topic representation that consists of a list of factors or concepts, which are automatically coordinated using the AND operator.
If the user at the initial stage can or wants to include synonyms, then they are coordinated using the OR operator. Hence, the Boolean Topic representation connects the different factors using the AND operator, where the factors can consist of single terms or several synonyms connected by the OR operator. One of the goals of the Smart Boolean approach is to make use of the structural knowledge contained in the text surrogates, where the different fields represent contexts of useful information.
Further, the Smart Boolean approach wants to use the fact that related concepts can share a common stem. The initial strategy of the Smart Boolean approach is to start out with the broadest possible query within the constraints of how the factors and their synonyms have been coordinated.
Hence, it modifies the Boolean Topic representation into the query surrogate by using only the stems of the concepts and searches for them over all the fields. Once the query surrogate has been performed, users are guided in the process of evaluating the retrieved document surrogates. They choose from a list of reasons to indicate why they consider certain documents as relevant.
Similarly, they can indicate why other documents are not relevant by interacting with a list of possible reasons. This user feedback is used by the Smart Boolean system to automatically modify the Boolean Topic representation or the query surrogate, whatever is more appropriate.
The Smart Boolean approach offers a rich set of strategies for modifying a query based on the received relevance feedback or the expressed need to narrow or broaden the query. The Smart Boolean retrieval paradigm has been implemented in the form of a system called CONIT, which is one of the earliest expert retrieval systems that was able to demonstrate that ordinary users, assisted by such a system, could perform equally well as experienced search intermediaries [Marcus ].
However, users have to navigate through a series of menus listing different choices, where it might be hard for them to appreciate the implications of some of these choices. A key limitation of the previous versions of the CONIT system has been that lacked a visual interface. The most recent version has a graphical interface and it uses the tiling metaphor suggested by Anick et al.
This visualization approach suffers from the limitation that it enables users to visualize specific queries, whereas we will propose a visual interface that represents all whole range of related Boolean queries in a single display, making changes in Boolean coordination more user-friendly. Further, the different strategies of modifying a query in CONIT require a better visualization metaphor to enable users to make use these search heuristics.
In Chapter 4 we show how some of these modification techniques can be visualized. The Smart Boolean approach and the methods described in this section provide users with relevance ranking [Fox and Koll , Marcus ].
We will briefly discuss the P-norm and the Fuzzy Logic approaches that extend the Boolean model to address the above issues. The P-norm method developed by Fox allows query and document terms to have weights, which have been computed by using term frequency statistics with the proper normalization procedures. These normalized weights can be used to rank the documents in the order of decreasing distance from the point 0, 0, Further, the Boolean operators have a coefficient P associated with them to indicate the degree of strictness of the operator from 1 for least strict to infinity for most strict, i.
The P-norm uses a distance-based measure and the coefficient P determines the degree of exponentiation to be used. The exponentiation is an expensive computation, especially for P-values greater than one.
In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the traditional binary membership choice. The weight of an index term for a given document reflects the degree to which this term describes the content of a document.
Hence, this weight reflects the degree of membership of the document in the fuzzy set associated with the term in question. The degree of membership for union and intersection of two fuzzy sets is equal to the maximum and minimum, respectively, of the degrees of membership of the elements of the two sets.
In the "Mixed Min and Max" model developed by Fox and Sharat the Boolean operators are softened by considering the query-document similarity to be a linear combination of the min and max weights of the documents. Both models use statistical information in the form of term frequencies to determine the relevance of documents with respect to a query.
Although they differ in the way they use the term frequencies, both produce as their output a list of documents ranked by their estimated relevance.
The statistical retrieval models address some of the problems of Boolean retrieval methods, but they have disadvantages of their own. We will also describe Latent Semantic Indexing and clustering approaches that are based on statistical retrieval approaches, but their objective is to respond to what the user's query did not say, could not say, but somehow made manifest [Furnas et al.
The creation of an index involves lexical scanning to identify the significant terms, where morphological analysis reduces different word forms to common "stems", and the occurrence of those stems is computed.
Query and document surrogates are compared by comparing their vectors, using, for example, the cosine similarity measure. In this model, the terms of a query surrogate can be weighted to take into account their importance, and they are computed by using the statistical distributions of the terms in the collection and in the documents [Salton ].
The vector space model can assign a high ranking score to a document that contains only a few of the query terms if these terms occur infrequently in the collection but frequently in the document. The vector space model makes the following assumptions: 1 The more similar a document vector is to a query vector, the more likely it is that the document is relevant to that query.
While it is a reasonable first approximation, the assumption that words are pairwise independent is not realistic.
The principle takes into account that there is uncertainty in the representation of the information need and the documents. There can be a variety of sources of evidence that are used by the probabilistic retrieval methods, and the most common one is the statistical distribution of the terms in both the relevant and non-relevant documents.
We will now describe the state-of-art system developed by Turtle and Croft that uses Bayesian inference networks to rank documents by using multiple sources of evidence to compute the conditional probability P Info need document that an information need is satisfied by a given document. An inference network consists of a directed acyclic dependency graph, where edges represent conditional dependency or causal relations between propositions represented by the nodes. The inference network consists of a document network, a concept representation network that represents indexing vocabulary, and a query network representing the information need.
The concept representation network is the interface between documents and queries. To compute the rank of a document, the inference network is instantiated and the resulting probabilities are propagated through the network to derive a probability associated with the node representing the information need. These probabilities are used to rank documents. The statistical approaches have the following strengths: 1 They provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the output by setting a relevance threshold or by specifying a certain number of documents to display.
However, the statistical approaches have the following shortcomings: 1 They have a limited expressive power. For example, the NOT operation can not be represented because only positive weights are used. For example, the very common and important Boolean query A and B or C and D can not be represented by a vector space query see section 5. Hence, the statistical approaches do not have the expressive power of the Boolean approach.
Proximity constraints are also difficult to express, a feature that is of great use for experienced searchers. As is the case for the Boolean approach, users are faced with the problem of having to choose the appropriate words that are also used in the relevant documents.
This table also shows the formulas that are commonly used to compute the term weights. The two central quantities used are the inverse term frequency in a collection idf , and the frequencies of a term i in a document j freq i,j. In the probabilistic model, the weight computation also considers how often a term appears in the relevant and irrelevant documents, but this presupposes that the relevant documents are known or that these frequencies can be reliably estimated.
If users provide the retrieval system with relevance feedback, then this information is used by the statistical approaches to recompute the weights as follows: the weights of the query terms in the relevant documents are increased, whereas the weights of the query terms that do not appear in the relevant documents are decreased [Salton and Buckley ]. There are multiple ways of computing and updating the weights, where each has its advantages and disadvantages.
We do not discuss these formulas in more detail, because research on relevance feedback has shown that significant effectiveness improvements can be gained by using quite simple feedback techniques [Salton and Buckley ]. Furthermore, what is important to this thesis is that the statistical retrieval approach generates a ranked list, however how this ranking has been computed in detail is immaterial for the purpose of this thesis.
In LSI the associations among terms and documents are calculated and exploited in the retrieval process. The assumption is that there is some "latent" structure in the pattern of word usage across documents and that statistical techniques can be used to estimate this latent structure.
An advantage of this approach is that queries can retrieve documents even if they have no words in common. The LSI technique captures deeper associative structure than simple term-to-term correlations and is completely automatic. The only difference between LSI and vector space methods is that LSI represents terms and documents in a reduced dimensional space of the derived indexing dimensions.
As with the vector space method, differential term weighting and relevance feedback can improve LSI performance substantially.
Foltz and Dumais compared four retrieval methods that are based on the vector-space model. The four methods were the result of crossing two factors, the first factor being whether the retrieval method used Latent Semantic Indexing or keyword matching, and the second factor being whether the profile was based on words or phrases provided by the user Word profile , or documents that the user had previously rated as relevant Document profile.
The LSI match-document profile method proved to be the most successful of the four methods.
This method combines the advantages of both LSI and the document profile. The document profile provides a simple, but effective, representation of the user's interests.
Traditional evaluation metrics, designed for Boolean retrieval [ clarification needed ] or top-k retrieval, include precision and recall. All measures assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice, queries may be ill-posed and there may be different shades of relevancy.
Timeline[ edit ] Before the s Joseph Marie Jacquard invents the Jacquard loom , the first machine to use punched cards to control a sequence of operations. That same year, Kent and colleagues published a paper in American Documentation describing the precision and recall measures as well as detailing a proposed "framework" for evaluating an IR system which included statistical sampling methods for determining the number of relevant documents not retrieved.
Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation. See: Cyril W. Cranfield Collection of Aeronautics, Cranfield, England, Kent published Information Analysis and Retrieval.
Alvin Weinberg. Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley Project Intrex at MIT. Licklider published Libraries of the Future. John W.