Back
Search and Ranking in CDI
Subscribe by RSS
Share page by email
Save as PDF
Share
Resource Types in CDI CDI Tips and Tricks
Return to menu
This page discusses some of the search (standard and multilingual) and ranking features available in the Ex Libris Central Discovery Index (CDI).
For Primo Central records, the Search section in the PNX includes the data (including metadata and full text) that is indexed for searching. This is not the case in CDI. The fields listed in the search section are searchable. However, this does not mean that these fields are the only indexed fields that will be searched (there may be more fields not searchable in PC that are searched in CDI) and they may not be searched in the same way.
Standard Search Features
Phrase Search
Enclosing multiple query terms in double quotes (“…") limits results to phrase matches. For example, a search for "computational linguistics" (in double quotes) will return phrase matches with computational linguistics, but not linguistics and computational chemistry or computational chemistry and linguistics. However, a phrase search will match other variations of a word or character in the phrase. For example, a search for “neural network” in its singular form will also match neural networks in its plural form. A search for "street facade" would match both "street facade" and "street façade". Phrase searches can be used for languages that do not use whitespace between words, such as Chinese, Japanese and Thai. For example, a search for "東京の歴史" (in double quotes) will match the exact phrase 東京の歴史 but not 東京の文化と歴史.
There are two exceptions to the phrase matching related to stop words. The first exception is the handling of stop words in the full text field. Unlike metadata, CDI does not index stop words in the full text field. As a result, matches with stop words within a phrase are not guaranteed for the full text field. For example, a search for "research for motion" can match "research in motion" since "for" and "in" are English stop words. As full text matches are ranked far lower than metadata matches, material with the exact phrase in the metadata will almost always outrank them in the result list. However, full text matches can become important if there are no or very few results with the exact phrase in the metadata, and it can lead to other relevant findings. On the downside, they contribute to a longer tail of results that may be less or not relevant to the users’ intentions. Another exception is that stop words which are placed at the end of a phrase are currently dropped from the phrase search. For example, a search for "there she was" drops the word "was" and match phrases such as "there she is" since the word "was" is defined as a stop word for English.
Phrase searching also increases the effect of the verbatim match boost feature. The verbatim match boost feature is part of CDI's relevance ranking algorithm, which boosts the relevance scores of verbatim matches – namely, the matches that do not match via character normalization, stemming or other multilingual search features. For example, searching for "heavy metals" (in double quotes) will emphasize that phrase over heavy metal to a much larger degree than the non-phrase search for heavy metals (without double quotes).
Using this property, double quotes can be used even on a single term when it is important to emphasize verbatim matches. For example, if a search for résumé (without double quotes) is returning undesirable matches for resume as top results, enclosing the search term in double quotes (for example, "résumé") will further emphasize the verbatim match over the non-verbatim matches.
Clarification about "Exact" Matching
There are a few levels of "exact" matches between a search query and indexed text.
  1. Phrase matching - In this documentation, phrase matching or exact phrase matching refers to matches where the words in a phrase are in the same order between the search query and the indexed text. For example, the following is a phrase match: query = computational linguistics and indexed text = computational linguistics. But the following is not an exact phrase match: query = "computational linguistics" and indexed text = linguistics and computational chemistry.
  2. Verbatim matching - In this documentation, we refer to verbatim matching as word-level matching, where words did not match via stemming, character normalization, synonym mapping, or any other processes. For example, "English book" vs. "English books" is an exact phrase match, but it is not a verbatim match. 
  3. Exhaustive matching (or exact title matching, exact subject matching, and so forth) - This type of matching refers to phrase matches that also completely match field values. For example, query="American history" is an exact phrase match for the title "19th Century American History", and it is not an exhaustive match.
  4. Exact string matching - While all the above allow variations in casing (for example, "Book" vs. "book"), the number of spaces between words (for example, "computational linguistics" with one space vs. "computational  linguistics" with two spaces), and the use of punctuation symbols (for example, "Paris, Texas" vs. "Paris Texas"), exact string matching requires exact matches at the character-level. This type of matching is typically required for Identifier fields.
CDI's phrase searching supports "exact phrase matching" for the default search fields, such as the title, author, and abstract fields. 
Boolean Operators
CDI supports the following Boolean operators: AND, OR, and NOT. They must be written in all capital letters to ensure that they are interpreted as Boolean operators by the system.
Wildcards
Searches in CDI be performed using two wildcards: the question mark (?) and the asterisk (*). Wildcards cannot be used as the first character of a search, nor should a wildcard be used within double quotes (phrase search).
The question mark (?) will match any one character. For example, it can be used to find Olsen or Olson by searching for Ols?n, but it will not find Olsson because there are two characters between the letters and the in that name.
The question mark (?) does not work as a wildcard character at the end of a word. This is to avoid a confusion when a question mark is used as a punctuation character. For example, the question mark in a search for who's afraid of virginia woolf? (with or without double quotes) will be interpreted as a punctuation mark, not as a wildcard. In this case, the final term will match woolf as most users would expect.
The asterisk (*) will match zero or more characters within a word or at the end of a word. A search for Ch*ter will match CharterCharacter, and Chapter.
When used at the end of a word, the asterisk will allow all possible characters to be included so Temp* will match Temptation, Temple, and Temporary.
Query Expansion (Based on Control Vocabulary)
CDI's Query Expansion feature assists patrons to find relevant literature, by adding preferred terms from controlled vocabularies to patrons’ queries. For example, if a patron issues a search for heart attack, the query expansion feature will expand the search query to heart attack OR myocardial infarction, because myocardial infarction is the preferred term for heart attack in some of the controlled vocabularies, such as LCSH (Library of Congress Subject Headings) and MeSH (Medical Subject Headings).
Field Truncations
CDI provides protection against very large field values that could cause various search and display issues. Such large field values may be due to accidental bad metadata mapping. For example, if a Table of Contents field is accidentally mapped to the Title field in a record, it could cause slow response times, display issues, and ranking issues. Large field values are truncated, either by the number of entries or the number of characters, or both, depending on the field. For example, the title and subtitle fields have a limit of 500 characters. The reference field has a limit of 1,000 entries. The author and editor fields have a limit of 100 entries. We periodically review the limits and will adjust them as needed.
Multilingual Search Features
The Ex Libris Central Discovery Index (CDI) uses the Unicode standard, and allows searching in various languages whose writing systems are supported by the Unicode standard. In addition, it provides enhanced language-specific search features in many languages, including the following languages:
CDI uses several techniques to provide enhanced search capabilities in these languages. Some of the most important processes are listed below. These processes are applied to search results based on the language of each CDI record. For example, English search features (tokenization, stemming, and so forth) are applied to English records and German search features are applied to German records.
These techniques are described in detail in the following sections. In addition, the following section describes how these features play a role in CDI’s relevance ranking algorithm.
Verbatim Match Boost (all languages)
Multilingual Search Architecture
CDI indexes the "analyzed" or "normalized" forms of words instead of the "surface" forms of the words. For example, the word books is indexed as its dictionary form book, instead of its surface form books. At search time, books used in a search query is also normalized as book. This makes the two forms book and books cross-searchable. Please note that the analyzed/normalized forms are internal data representations, and not what users see in the UI display. Users still see the original field values -- in this example, books—in the UI display.
For example, book vs. books:
  1. Index Time: 
    1. books → book (normalized according to the language of the record)
    2. book → book (normalized according to the language of the record)
  2. Search Time: 
    1. books → book (normalized according to the language of the record)
    2. book → book (normalized according to the language of the record)
This approach has several advantages:
CDI is a dynamic index (i.e., updated frequently), and that allows the Ex Libris development team to update the text analysis (normalization) algorithms for both index time and search time to improve CDI's search and ranking features.
 
Tokenization
Tokenization is the process of breaking a stream of letters or text into words, phrases, or meaningful elements. Tokenization is part of CDI's language-specific text analysis, which is performed at both index time and search time, and resulting tokens constitute the smallest searchable unit in CDI.
In most languages, words are separated by white space or punctuation, so tokenization is a simple process for those languages. However, in languages such as Chinese, Japanese and Thai, words are not separated by white space. For these languages, CDI's text analysis uses sophisticated techniques to identify word boundaries, and use that information to perform tokenization.
Examples of Tokenization:
“Black cat” becomes the two searchable units “black” and “cat”; “梵文基础读本” becomes the three searchable units “梵文”, “基础” and “读本”; and “東京タワー” becomes the two searchable units “東京” and “タワー”.
Decompounding
Compound words are words that consist of multiple components that can stand as individual words on their own. In languages such as German, Swedish, and Danish, compound words are spelled without white space, and as a result, they can be very long.
Decompounding is the process of finding constituent parts in a compound word. CDI performs this process for languages such as German, Swedish, Danish and Korean. This process allows the patron to search for those constituent parts and get matches on the compound word.
Example:
Searching for German words abwasser anlagen (which is wastewater plant in English) returns results matching the compound word abwasserbehandlungsanlage (which is wastewater treatment plant in English)
Stemming/Lemmatization
Stemming is the process of reducing inflected (or sometimes derived) words to their stems, or the root forms. Lemmatization is the process of converting various forms of a word to its dictionary form. Despite the slight differences, these processes have the same goals, and these terms are often used interchangeably. CDI performs language-specific stemming or lemmatization to allow the patron to search for a form of a word and get matches on other forms of the same word.
Examples:
In the first example above, searches for the word book will return results for both book and books. Searches for grande maison will return results for both grande maison and grandes maisons in French records.
Character Normalization
Character normalization is the process of normalizing variants of a character to its basic version. Characters with diacritics are, for example, normalized to the characters without diacritics. CDI also provides character normalization for variants of Chinese characters.
Character normalization allows the patron to search for a word containing a diacritic and get results on the word without the diacritic, and vice versa. Similarly, it allows the patron to search for a Chinese word using the traditional characters, and get hits on the word spelled with the simplified characters, and vice versa. The character normalization mappings are mostly the same across all languages, but in some cases, language specific character normalization mappings are defined.
Examples:
The Chinese search for 大學 will return results for 大学, and the Spanish search for Mexico will return results for México.
In some cases, CDI allows for multiple ways to represent a character with a diacritic. For example, the German umlauts ä, ö, and ü can be spelled without the diacritic as ae, oe and ue, or a, o, and u. CDI allows both variations. This allows the patron to search for schoen or schon and get results matching on schön. Another example is the Spanish ñ, which can be searched for by using ñ, n, or ni. This allows the query terms Espanol and Espaniol to return results matching on Español.
Transliteration
Transliteration is a conversion of one script to another. This process allows for searching in one script and get hits on the same words written in another script.
CDI currently provides transliteration search features for Chinese (Hanzi-Pinyin), Japanese (Kanji/Katakana-Hiragana) and Korean (Hanja-Hangul) for titles and author names. Chinese Pinyin transliterations can be written with spaces between words (for example, beijing daxue), or with spaces at the Hanzi-character boundaries or syllable boundaries (for example, bei jing da xue).
Examples:
The Chinese query beijingdaxue ("Peking University" in Pinyin transliteration) would return results containing the string 北京大学 ("Peking University" in Hanzi script).
The same Chinese query written as beijing daxue or bei jing da xue (use double quotes for better results) would also return results containing the string 北京大学 ("Peking University" in Hanzi script).
The Japanese query なつめそうせき ("Natsume Souseki" in Hiragana script) would return results containing the string 夏目漱石 ("Natsume Souseki" in Kanji script).
The Korean query 경제 (“economy" in Hangul script) would return results containing the string 經濟 (“economy" in Hanja script).
If a search is performed using transliteration, then transliterated search results are not necessarily the first results to display.
Elision Handling
Elision in this case refers to the omission of a final vowel of a word when the following word begins with a vowel, and is observed in languages such as French and Italian.
For example, in French, the word sequence le + arbre becomes l'arbre. In Italian, lo + amico becomes l’amico.
CDI’s elision handling allows the patron to search for amico and get hits on l’amico.
Synonym Mapping and Spelling Normalization
CDI provides language-specific simple synonym mappings and spelling normalization. For example, in English, the words theater and theatre are two spellings of the same word. These are normalized during CDI's English text analysis, and as a result, the patron can search using one of these spellings and get hits on both spellings. Language-specific synonyms are also defined for cases where two words have the same meaning.
Examples:
In addition, the ampersand (&) is equated with the appropriate word for the word and in each language.
Handling of Ampersand ("&") Character
The ampersand character ("&") is a synonym of and, et, und, or other equivalent words in CDI's supported languages. This allows the cross-searching of cats and dogs and cats & dogs—for example, in English documents.
The synonym mapping is performed according to the language of each record. For example, & is mapped to and in English records, and & is mapped to et in French records. As a result, the number of results between the search queries cats and dogs and cats & dogs may not be the same because cats and dogs may appear in non-English records. Similarly, the number of results between chats et chiens and chats & chiens may not be the same because chats et chiens may appear in non-French records.
Currently, these mappings apply in all fields except the author field. 
Stop Words
A stop word is a word that acts as a function (such as a definite/indefinite article, preposition, pronoun, conjunction, and auxiliary verb, occurs very frequently in CDI, and does not have a common secondary meaning as a word that contains content.
CDI maintains language-specific lists of stop words, which are filtered out in the execution of searches except when they are part of formal phrase searches as described below. Stop words are chosen according to the following basic criteria.
CDI's current English stop words include a, an, the, and, but, or, it, of, on, with, in, is, and are, but it dot not include the word will since it has a common secondary meaning as a noun.
In general, CDI ignores stop words in queries in order to improve the accuracy and efficiency of the search. However, in a phrase search (with search terms in double quotes), all stop words become required words, except those appearing at the end of the phrase. For example, the query man of the year includes two English stop words of and the. If this query is issued without double quotes (i.e., man of the year), it returns results containing the words man and year, and CDI's relevance algorithm boosts the ranking of results that contain the phrase man of the year.
If the query is issued as an phrase search with double quotes (for example, "man of the year"), CDI returns results containing the exact phrase and does not exclude the stop words. However, there are some limitations for phrase matches in the full text field, and phrases that end with stop words. For more details, see Phrase Search.
Verbatim Match Boost
This is one of the most important features of CDI's native language search support.  Many of the features described in this document allow the patron to get results when the query terms and the indexed terms are equivalent, but not exactly the same word for word (verbatim).  These features have an effect of increasing the number of results, or in other words, increasing the search recall.  While these features will provide a better user experience to the patrons, there is a risk of including less relevant or non-relevant results, and reducing the search precision.
The Verbatim Match Boost feature addresses this concern by boosting the relevance score of a result when the matching of the query terms and the indexed terms is verbatim or near verbatim.
Example:
For the English search query theatres, the results for theatres get higher relevance scores (and are ranked higher) than the results for theaters or theatre if all other factors that contribute to the relevance scoring calculation are equal.
The Verbatim Match Boost feature is applied to almost all processes mentioned in this support document. The actual implementation of this feature works by penalizing non-verbatim matches, thus effectively boosting the verbatim matches. Penalties are computed for each term/token, and the penalty amount is defined for each process in each language, such that major differences, such as synonyms, are penalized more than minor differences, such as spelling differences and singular vs. plural forms of nouns.
Relevance Ranking
When a patron issues a search query in Primo backed by the Ex Libris Central Discovery Index (CDI), the query is issued to both the local index and CDI. Search results from each of these indexes are ranked according to their relevance ranking algorithms, and they are blended to form the final search results presented to the patron. This document discusses the relevance ranking algorithm used by CDI.
Relevance ranking in CDI is determined according to a continuously tuned, proprietary algorithm, and is built on a foundation of two building blocks: the Dynamic Rank and the Static Rank. The Dynamic Rank is a collection of relevance factors that represent how well a search query matches each record, and the Static Rank is a collection of relevance factors that represent the value or importance of each record. Both of these are important in determining the ranking, and top results need to have good scores from both Dynamic Rank and Static Rank.
Dynamic Rank
The Dynamic Rank represents how well the user's query matches each record.  Dynamic Rank factors include the following:
Static Rank
The Static Rank represents the value of each item, and does not pertain to the user's query terms.  Static Rank factors include the following:
Each record's Static Rank score is determined as a combination of scores calculated from these factors, using carefully designed mathematical functions. For example, a journal article published 5 years ago with 100 citations would probably have a higher Static Rank score than a journal article published 6 months ago with 0 citations. In this case, the benefit of the high citation counts of the first record outweighs the benefit of the recency of the second record.
The scores from Dynamic Rank and Static Rank are then combined to determine the relevance score of each record for the given query. The ranking of a search result set is determined by the final relevance scores of the records in the result set.
CDI's relevance ranking algorithm is tuned to provide best search experience for both known item searching and other types of searching (for example, subject searching, exploratory searching, topical searching, existence searching, unknown item searching, and so forth). Additionally, there are aspects of CDI relevance that assist the user community comprised of the novice researcher, the professional researcher and all user types in-between. For example, short and general topical queries (for example, linguisticsglobal warming) tend to return more books, eBooks, references and journals among the top results, and long and specific topical queries (for example, linguistics universal grammarglobal warming Kyoto protocol) tend to return more articles among the top results.
CDI overlays this foundation with a regimen of judgments to ensure that relevance as a whole remains strong as individual pieces of the system are improved. The relevance ranking system in CDI is shared by all customers, and is not customizable for individual institutions.
Back to top
Resource Types in CDI CDI Tips and Tricks
Was this article helpful?
Term of UsePrivacy Policy
Cookie PreferencesContact Us
2021 Ex Libris. All rights reserved
Home Primo Content Corner Central Discovery Index Documentation and Training Documentation and Training (English)
Standard Search FeaturesPhrase SearchClarification about "Exact" MatchingBoolean OperatorsWildcardsQuery Expansion (Based on Control Vocabulary)Field TruncationsMultilingual Search FeaturesMultilingual Search ArchitectureTokenizationDecompoundingStemming/LemmatizationCharacter NormalizationTransliterationElision HandlingSynonym Mapping and Spelling NormalizationHandling of Ampersand ("&") CharacterStop WordsVerbatim Match BoostRelevance RankingDynamic RankStatic Rank