This page discusses some of the search (standard and multilingual) and ranking features available in the Ex Libris Central Discovery Index (CDI).
For Primo Central records, the Search section in the PNX includes the data (including metadata and full text) that is indexed for searching. This is not the case in CDI. The fields listed in the search section are searchable. However, this does not mean that these fields are the only indexed fields that will be searched (there may be more fields not searchable in PC that are searched in CDI) and they may not be searched in the same way.
Standard Search Features
Enclosing multiple query terms in double quotes (“…") limits results to phrase matches. For example, a search for "computational linguistics" (in double quotes) will return phrase matches with computational linguistics, but not linguistics and computational chemistry or computational chemistry and linguistics. However, a phrase search will match other variations of a word or character in the phrase. For example, a search for “neural network” in its singular form will also match neural networks in its plural form. A search for "street facade" would match both "street facade" and "street façade". Phrase searches can be used for languages that do not use whitespace between words, such as Chinese, Japanese and Thai. For example, a search for "東京の歴史" (in double quotes) will match the exact phrase 東京の歴史 but not 東京の文化と歴史.
There are two exceptions to the phrase matching related to stop words. The first exception is the handling of stop words in the full text field. Unlike metadata, CDI does not index stop words in the full text field. As a result, matches with stop words within a phrase are not guaranteed for the full text field. For example, a search for "research for motion" can match "research in motion" since "for" and "in" are English stop words. As full text matches are ranked far lower than metadata matches, material with the exact phrase in the metadata will almost always outrank them in the result list. However, full text matches can become important if there are no or very few results with the exact phrase in the metadata, and it can lead to other relevant findings. On the downside, they contribute to a longer tail of results that may be less or not relevant to the users’ intentions. Another exception is that stop words which are placed at the end of a phrase are currently dropped from the phrase search. For example, a search for "there she was" drops the word "was" and match phrases such as "there she is" since the word "was" is defined as a stop word for English.
Phrase searching also increases the effect of the verbatim match boost feature. The verbatim match boost feature is part of CDI's relevance ranking algorithm, which boosts the relevance scores of verbatim matches – namely, the matches that do not match via character normalization, stemming or other multilingual search features. For example, searching for "heavy metals" (in double quotes) will emphasize that phrase over heavy metal to a much larger degree than the non-phrase search for heavy metals (without double quotes).
Using this property, double quotes can be used even on a single term when it is important to emphasize verbatim matches. For example, if a search for résumé (without double quotes) is returning undesirable matches for resume as top results, enclosing the search term in double quotes (for example, "résumé") will further emphasize the verbatim match over the non-verbatim matches.
Clarification about "Exact" Matching
There are a few levels of "exact" matches between a search query and indexed text.
Phrase matching - In this documentation, phrase matching or exact phrase matching refers to matches where the words in a phrase are in the same order between the search query and the indexed text. For example, the following is a phrase match: query = computational linguistics and indexed text = computational linguistics. But the following is not an exact phrase match: query = "computational linguistics" and indexed text = linguistics and computational chemistry.
Verbatim matching - In this documentation, we refer to verbatim matching as word-level matching, where words did not match via stemming, character normalization, synonym mapping, or any other processes. For example, "English book" vs. "English books" is an exact phrase match, but it is not a verbatim match.
Exhaustive matching (or exact title matching, exact subject matching, and so forth) - This type of matching refers to phrase matches that also completely match field values. For example, query="American history" is an exact phrase match for the title "19th Century American History", and it is not an exhaustive match.
Exact string matching - While all the above allow variations in casing (for example, "Book" vs. "book"), the number of spaces between words (for example, "computational linguistics" with one space vs. "computational linguistics" with two spaces), and the use of punctuation symbols (for example, "Paris, Texas" vs. "Paris Texas"), exact string matching requires exact matches at the character-level. This type of matching is typically required for Identifier fields.
CDI's phrase searching supports "exact phrase matching" for the default search fields, such as the title, author, and abstract fields.
CDI supports the following Boolean operators: AND, OR, and NOT. They must be written in all capital letters to ensure that they are interpreted as Boolean operators by the system.
The AND operator – When there is no explicit Boolean operator between two terms, the AND operator is assumed. For example, if you search for earthquake fault, you will get the same result set as when you search for earthquake AND fault. Note that the relevance ranking of the result set may be different since the first search applies higher relevance scores to phrase matches.
The OR operator – This operator can be used when only one of multiple search terms need to match. Example: cats OR felines
The NOT operator – This operator is always applied to the term or Boolean expression that is immediately following the operator. The NOT operator is normally used with another term or expression to exclude certain matches. It can be used in the following ways:
dogs NOT cats
dogs NOT (cats)
dogs AND NOT cats
dogs AND NOT (cats)
Defining precedence of Boolean expressions – Parentheses are used to group Boolean expressions, and they define the precedence of Boolean expressions. A general rule of thumb is to always use parentheses when there is any ambiguity in a Boolean expression. Examples: cats AND (dogs OR raccoons), (cats AND dogs) OR raccoons.
Boolean searches in German language UI – Conforming to a standard practice in German-language search engines, when the German UI selected, Boolean operator words UND, ODER, and NICHT act as alternatives to AND, OR, and NOT. The English operators will continue to work in the German UI.
Boolean search and CDI relevance ranking algorithm – Boolean queries get processed by the same relevance ranking algorithm as any other query.
Searches in CDI be performed using two wildcards: the question mark (?) and the asterisk (*). Wildcards cannot be used as the first character of a search, nor should a wildcard be used within double quotes (phrase search).
The question mark (?) will match any one character. For example, it can be used to find Olsen or Olson by searching for Ols?n, but it will not find Olsson because there are two characters between the letters s and the n in that name.
The question mark (?) does not work as a wildcard character at the end of a word. This is to avoid a confusion when a question mark is used as a punctuation character. For example, the question mark in a search for who's afraid of virginia woolf? (with or without double quotes) will be interpreted as a punctuation mark, not as a wildcard. In this case, the final term will match woolf as most users would expect.
The asterisk (*) will match zero or more characters within a word or at the end of a word. A search for Ch*ter will match Charter, Character, and Chapter.
When used at the end of a word, the asterisk will allow all possible characters to be included so Temp* will match Temptation, Temple, and Temporary.
The use of wildcards within a phrase search is not supported.
A wildcard search does not necessarily return more results than the same search without the wildcard. This is because CDI’s multilingual search features, such as stemming/lemmatization, synonym mapping and spelling normalization, do not apply to the wildcard search. For example, a keyword search for archaeology may return more results than the wildcard search for archaeolog*, since the former matches both archaeology and archeology via CDI's English spelling normalization feature, but the latter matches only archaeology and not archeology.
The use of a wildcard does not necessary improve relevance ranking. In some cases, it could hurt relevance ranking as some relevance factors, such as the phrase match boosting and term weighting, do not apply to wildcard searches.
Query Expansion (Based on Control Vocabulary)
CDI's Query Expansion feature assists patrons to find relevant literature, by adding preferred terms from controlled vocabularies to patrons’ queries. For example, if a patron issues a search for heart attack, the query expansion feature will expand the search query to heart attack OR myocardial infarction, because myocardial infarction is the preferred term for heart attack in some of the controlled vocabularies, such as LCSH (Library of Congress Subject Headings) and MeSH (Medical Subject Headings).
The Query Expansion feature will not expand phrase searches (in double quotes).
The Query Expansion feature will not expand terms that are very commonly used. For example, it will not expand AIDS to acquired immunodeficiency syndrome since the term AIDS is commonly used in literature.
The Query Expansion feature will not expand terms in long queries.
CDI provides protection against very large field values that could cause various search and display issues. Such large field values may be due to accidental bad metadata mapping. For example, if a Table of Contents field is accidentally mapped to the Title field in a record, it could cause slow response times, display issues, and ranking issues. Large field values are truncated, either by the number of entries or the number of characters, or both, depending on the field. For example, the title and subtitle fields have a limit of 500 characters. The reference field has a limit of 1,000 entries. The author and editor fields have a limit of 100 entries. We periodically review the limits and will adjust them as needed.
Multilingual Search Features
The Ex Libris Central Discovery Index (CDI) uses the Unicode standard, and allows searching in various languages whose writing systems are supported by the Unicode standard. In addition, it provides enhanced language-specific search features in many languages, including the following languages:
Chinese (Simplified and Traditional)
CDI uses several techniques to provide enhanced search capabilities in these languages. Some of the most important processes are listed below. These processes are applied to search results based on the language of each CDI record. For example, English search features (tokenization, stemming, and so forth) are applied to English records and German search features are applied to German records.
Synonym Mapping and Spelling Normalization
These techniques are described in detail in the following sections. In addition, the following section describes how these features play a role in CDI’s relevance ranking algorithm.
Verbatim Match Boost (all languages)
Multilingual Search Architecture
CDI indexes the "analyzed" or "normalized" forms of words instead of the "surface" forms of the words. For example, the word books is indexed as its dictionary form book, instead of its surface form books. At search time, books used in a search query is also normalized as book. This makes the two forms book and books cross-searchable. Please note that the analyzed/normalized forms are internal data representations, and not what users see in the UI display. Users still see the original field values -- in this example, books—in the UI display.
For example, book vs. books:
books → book (normalized according to the language of the record)
book → book (normalized according to the language of the record)
books → book (normalized according to the language of the record)
book → book (normalized according to the language of the record)
This approach has several advantages:
It is suitable for supporting morphologically rich languages, such as French, German, Japanese, Arabic and Hebrew, as well as languages with extensive writing systems, such as Chinese, Korean, and Japanese. For example, each Chinese character may have several variations and that can be easily supported by this approach.
It allows for true multilingual search capabilities. Search queries are normalized according to the language of each record, and every query is compared against all records using their language-specific search features. For example, the word "kind" in a search query can match "kinds" in English documents and "Kinder" (children) in German documents. This benefits bilingual or multilingual users regardless of the primary language or location of the library.
It is scalable, and it allows for the addition of support for new languages without affecting the relevance ranking of records in the languages that are already supported by CDI.
CDI is a dynamic index (i.e., updated frequently), and that allows the Ex Libris development team to update the text analysis (normalization) algorithms for both index time and search time to improve CDI's search and ranking features.
Tokenization is the process of breaking a stream of letters or text into words, phrases, or meaningful elements. Tokenization is part of CDI's language-specific text analysis, which is performed at both index time and search time, and resulting tokens constitute the smallest searchable unit in CDI.
In most languages, words are separated by white space or punctuation, so tokenization is a simple process for those languages. However, in languages such as Chinese, Japanese and Thai, words are not separated by white space. For these languages, CDI's text analysis uses sophisticated techniques to identify word boundaries, and use that information to perform tokenization.
Examples of Tokenization:
black cat => black + cat (English)
梵文基础读本 => 梵文+基础+读本 (Chinese)
東京タワー => 東京 + タワー (Japanese)
“Black cat” becomes the two searchable units “black” and “cat”; “梵文基础读本” becomes the three searchable units “梵文”, “基础” and “读本”; and “東京タワー” becomes the two searchable units “東京” and “タワー”.
Compound words are words that consist of multiple components that can stand as individual words on their own. In languages such as German, Swedish, and Danish, compound words are spelled without white space, and as a result, they can be very long.
Decompounding is the process of finding constituent parts in a compound word. CDI performs this process for languages such as German, Swedish, Danish and Korean. This process allows the patron to search for those constituent parts and get matches on the compound word.
Searching for German words abwasser anlagen (which is wastewater plant in English) returns results matching the compound word abwasserbehandlungsanlage (which is wastewater treatment plant in English)
Stemming is the process of reducing inflected (or sometimes derived) words to their stems, or the root forms. Lemmatization is the process of converting various forms of a word to its dictionary form. Despite the slight differences, these processes have the same goals, and these terms are often used interchangeably. CDI performs language-specific stemming or lemmatization to allow the patron to search for a form of a word and get matches on other forms of the same word.
books vs. book (English)
ponies vs. pony (English)
theses vs. thesis (English)
maisons vs. maison (French)
grandes vs. grande (French)
Kinder vs. Kind (German)
In the first example above, searches for the word book will return results for both book and books. Searches for grande maison will return results for both grande maison and grandes maisons in French records.
Character normalization is the process of normalizing variants of a character to its basic version. Characters with diacritics are, for example, normalized to the characters without diacritics. CDI also provides character normalization for variants of Chinese characters.
Character normalization allows the patron to search for a word containing a diacritic and get results on the word without the diacritic, and vice versa. Similarly, it allows the patron to search for a Chinese word using the traditional characters, and get hits on the word spelled with the simplified characters, and vice versa. The character normalization mappings are mostly the same across all languages, but in some cases, language specific character normalization mappings are defined.
大学 vs. 大學 (Chinese)
México vs. Mexico (Spanish)
The Chinese search for 大學 will return results for 大学, and the Spanish search for Mexico will return results for México.
In some cases, CDI allows for multiple ways to represent a character with a diacritic. For example, the German umlauts ä, ö, and ü can be spelled without the diacritic as ae, oe and ue, or a, o, and u. CDI allows both variations. This allows the patron to search for schoen or schon and get results matching on schön. Another example is the Spanish ñ, which can be searched for by using ñ, n, or ni. This allows the query terms Espanol and Espaniol to return results matching on Español.
Transliteration is a conversion of one script to another. This process allows for searching in one script and get hits on the same words written in another script.
CDI currently provides transliteration search features for Chinese (Hanzi-Pinyin), Japanese (Kanji/Katakana-Hiragana) and Korean (Hanja-Hangul) for titles and author names. Chinese Pinyin transliterations can be written with spaces between words (for example, beijing daxue), or with spaces at the Hanzi-character boundaries or syllable boundaries (for example, bei jing da xue).
The Chinese query beijingdaxue ("Peking University" in Pinyin transliteration) would return results containing the string 北京大学 ("Peking University" in Hanzi script).
The same Chinese query written as beijing daxue or bei jing da xue (use double quotes for better results) would also return results containing the string 北京大学 ("Peking University" in Hanzi script).
The Japanese query なつめそうせき ("Natsume Souseki" in Hiragana script) would return results containing the string 夏目漱石 ("Natsume Souseki" in Kanji script).
The Korean query 경제 (“economy" in Hangul script) would return results containing the string 經濟 (“economy" in Hanja script).
If a search is performed using transliteration, then transliterated search results are not necessarily the first results to display.
Elision in this case refers to the omission of a final vowel of a word when the following word begins with a vowel, and is observed in languages such as French and Italian.
For example, in French, the word sequence le + arbre becomes l'arbre. In Italian, lo + amico becomes l’amico.
CDI’s elision handling allows the patron to search for amico and get hits on l’amico.
Synonym Mapping and Spelling Normalization
CDI provides language-specific simple synonym mappings and spelling normalization. For example, in English, the words theater and theatre are two spellings of the same word. These are normalized during CDI's English text analysis, and as a result, the patron can search using one of these spellings and get hits on both spellings. Language-specific synonyms are also defined for cases where two words have the same meaning.
theater vs. theatre (English)
accessorize vs. accessorise (English)
analog vs. analogue (English)
ordenador vs. computadora (Spanish)
In addition, the ampersand (&) is equated with the appropriate word for the word and in each language.
Handling of Ampersand ("&") Character
The ampersand character ("&") is a synonym of and, et, und, or other equivalent words in CDI's supported languages. This allows the cross-searching of cats and dogs and cats & dogs—for example, in English documents.
The synonym mapping is performed according to the language of each record. For example, & is mapped to and in English records, and & is mapped to et in French records. As a result, the number of results between the search queries cats and dogs and cats & dogs may not be the same because cats and dogs may appear in non-English records. Similarly, the number of results between chats et chiens and chats & chiens may not be the same because chats et chiens may appear in non-French records.
Currently, these mappings apply in all fields except the author field.
A stop word is a word that acts as a function (such as a definite/indefinite article, preposition, pronoun, conjunction, and auxiliary verb, occurs very frequently in CDI, and does not have a common secondary meaning as a word that contains content.
CDI maintains language-specific lists of stop words, which are filtered out in the execution of searches except when they are part of formal phrase searches as described below. Stop words are chosen according to the following basic criteria.
CDI's current English stop words include a, an, the, and, but, or, it, of, on, with, in, is, and are, but it dot not include the word will since it has a common secondary meaning as a noun.
In general, CDI ignores stop words in queries in order to improve the accuracy and efficiency of the search. However, in a phrase search (with search terms in double quotes), all stop words become required words, except those appearing at the end of the phrase. For example, the query man of the year includes two English stop words of and the. If this query is issued without double quotes (i.e., man of the year), it returns results containing the words man and year, and CDI's relevance algorithm boosts the ranking of results that contain the phrase man of the year.
If the query is issued as an phrase search with double quotes (for example, "man of the year"), CDI returns results containing the exact phrase and does not exclude the stop words. However, there are some limitations for phrase matches in the full text field, and phrases that end with stop words. For more details, see Phrase Search.
Verbatim Match Boost
This is one of the most important features of CDI's native language search support. Many of the features described in this document allow the patron to get results when the query terms and the indexed terms are equivalent, but not exactly the same word for word (verbatim). These features have an effect of increasing the number of results, or in other words, increasing the search recall. While these features will provide a better user experience to the patrons, there is a risk of including less relevant or non-relevant results, and reducing the search precision.
The Verbatim Match Boost feature addresses this concern by boosting the relevance score of a result when the matching of the query terms and the indexed terms is verbatim or near verbatim.
For the English search query theatres, the results for theatres get higher relevance scores (and are ranked higher) than the results for theaters or theatre if all other factors that contribute to the relevance scoring calculation are equal.
The Verbatim Match Boost feature is applied to almost all processes mentioned in this support document. The actual implementation of this feature works by penalizing non-verbatim matches, thus effectively boosting the verbatim matches. Penalties are computed for each term/token, and the penalty amount is defined for each process in each language, such that major differences, such as synonyms, are penalized more than minor differences, such as spelling differences and singular vs. plural forms of nouns.
When a patron issues a search query in Primo backed by the Ex Libris Central Discovery Index (CDI), the query is issued to both the local index and CDI. Search results from each of these indexes are ranked according to their relevance ranking algorithms, and they are blended to form the final search results presented to the patron. This document discusses the relevance ranking algorithm used by CDI.
Relevance ranking in CDI is determined according to a continuously tuned, proprietary algorithm, and is built on a foundation of two building blocks: the Dynamic Rank and the Static Rank. The Dynamic Rank is a collection of relevance factors that represent how well a search query matches each record, and the Static Rank is a collection of relevance factors that represent the value or importance of each record. Both of these are important in determining the ranking, and top results need to have good scores from both Dynamic Rank and Static Rank.
The Dynamic Rank represents how well the user's query matches each record. Dynamic Rank factors include the following:
Field weighting – When a query term or phrase matches in a field of a record, a score is generated according to the importance of the field. For example, Title, Subtitle and Subject are the highest weighted fields. The Creator and Abstract fields are weighted lower than these, but higher than other metadata fields. The Full Text field is weighted the lowest.
Term weighting – Matches on rare terms are weighted higher than matches on common terms. For example, if a given query is yoruba books, the less common term "yoruba" has a higher influence than the common term “book".
Term frequency and field length – The number of a matching term repeated within a field is also considered. For example, if a given query is nanobiotechnology, an abstract that contains five occurrences of the term would score higher than an abstract of the same length that contains the term only once. Similarly, the length of the field where a match occurs is considered in determining the weight of the match.
Verbatim match boost – A given query term could match an indexed term via multilingual search features, such as stemming, synonym mapping, and character normalization. Such non-verbatim matches are weighted less than verbatim matches where the query term is exactly the same as the indexed term. For example, if a given query is cliché, matches on cliché are scored higher than matches on cliches.
Phrase and proximity match boost – If a given query contains multiple terms and double quotes are not used, matches on the exact phrases (phrase match) and close phrase matches (proximity matches) are given a boost in the score. For example, if a given query is American history (without double quotes), the exact phrase match "American history" scores higher than the non-exact phrase match (proximity match) "American automobile history”, which in turn scores higher than a match on "American" and "history" appearing in different fields.
Exact title and title+subtitle match boost – The exact title match boost feature boosts scores for cases where a given query matches the title or title+subtitle. This helps known item searches consisting of a title or title+subtitle.
Known item search boost – In addition to the exact title match boost feature above, the known item search boost feature emphasizes matches where a given query contains a combination of common elements of known item searches, such as title, subtitle, author, and publication title. For example, a query an inconvenient truth global warming al gore (without double quotes) boosts matches on the books titled "An Inconvenient Truth: The Planetary Emergency of Global Warming and What We Can Do About It" and "An Inconvenient Truth: The Crisis of Global Warming" authored by Al Gore.
The Static Rank represents the value of each item, and does not pertain to the user's query terms. Static Rank factors include the following:
Resource type – Items are weighted according to their resource types. For example, books are weighted higher than book reviews; articles (journal articles) are weighted higher than newspaper articles, and so on.
Publication date – Recent items are weighted higher than older items. CDI uses carefully designed mathematical functions specific to each content type to maximize the effectiveness of this factor. For example, the penalty for having an old publication date is higher for articles than for books.
Scholarly/Peer review – Articles from "scholarly" or "peer reviewed" journals are boosted.
Citation counts – Citation counts are used to reward publications with high citation counts.
Journal rank –Journal scores in academic journal rankings are also considered, and articles published in highly respected academic journals are boosted according to their journal scores.
Anonymous author – Anonymous author items are demoted. Anonymous items may include editor's notes, letter's to the editor, obituaries, and other non-primary articles in journals.
Each record's Static Rank score is determined as a combination of scores calculated from these factors, using carefully designed mathematical functions. For example, a journal article published 5 years ago with 100 citations would probably have a higher Static Rank score than a journal article published 6 months ago with 0 citations. In this case, the benefit of the high citation counts of the first record outweighs the benefit of the recency of the second record.
The scores from Dynamic Rank and Static Rank are then combined to determine the relevance score of each record for the given query. The ranking of a search result set is determined by the final relevance scores of the records in the result set.
CDI's relevance ranking algorithm is tuned to provide best search experience for both known item searching and other types of searching (for example, subject searching, exploratory searching, topical searching, existence searching, unknown item searching, and so forth). Additionally, there are aspects of CDI relevance that assist the user community comprised of the novice researcher, the professional researcher and all user types in-between. For example, short and general topical queries (for example, linguistics, global warming) tend to return more books, eBooks, references and journals among the top results, and long and specific topical queries (for example, linguistics universal grammar, global warming Kyoto protocol) tend to return more articles among the top results.
CDI overlays this foundation with a regimen of judgments to ensure that relevance as a whole remains strong as individual pieces of the system are improved. The relevance ranking system in CDI is shared by all customers, and is not customizable for individual institutions.