brief guidelines on authority control decision-making - Metadata and Cataloging

What is authority control?
Why is authority control important?
What are controlled vocabularies?
How do I know when I need authority control?
Architectural issues for authority file building & linking
Sources for name headings
Sources for geographic headings
Sources for subject terms
Sources for media types, genre/format, and other resource attributes
How the NCSU Libraries Metadata & Cataloging department can help you

What is authority control?

Authority control is concerned with building and maintaining controlled vocabularies of terms, such as names, subjects, media types, and titles, to be used as headings in bibliographic records.

Why is authority control important?

Authority control is invaluable to search and discovery because it standardizes multiple forms of a given search term, increasing the likelihood that the search will return all relevant items.

On the end-user side, authority control can be used to lead to authorized forms of names or to expose the entire thesaurus of authorized terms so that a selection can be made from that list. This helps the user to better understand the scope of a collection and how it has been described, and also gives them some confidence that the metadata has been created in a consistent and well thought-out manner. In addition, authority control enables direct linking between terms of relevance to the searcher in one resource and other resources with the same subject, author, or format.

Authority control is beneficial for both metadata creators and for users of that metadata. For metadata providers it saves time, since data entry can be set up so that typing a few words of an entry brings up an already established term or phrase. This can be used against simple lists of names or subject terms, or it can be used with long phrases that contain policy statements, such as rights or copyright declarations. Entering an unauthorized term or phrase will result in no matches against an authority file, letting the metadata creator know either that the term is new to the list or it has not been entered in the authorized form. If cross-references are built into the authority file, entering an unauthorized term or name could lead you to the authorized form and change your entry accordingly, depending on how your system is designed.

What are controlled vocabularies?

Controlled vocabularies are lists of terms used to provide consistency in form and definition within a given field for retrieval purposes. These can be either short lists developed internally to guide metadata creators in categorizing materials, such as lists of date ranges, material formats, or even metadata creators, or they can be large external lists, such as structured thesauri, subject and name authority files, or standardized technical term lists. These external lists are often maintained by agencies that are expert in their fields, such as the National Library of Medicine, the Getty Institute, the United States Geological Survey, IEEE, W3C, and so on, and they are offered as public services through user-friendly Web interfaces.

How do I know when I need authority control?

Authority control can be expensive and time-consuming work. Conversely, it can also save a lot of time and money, particularly when you can reuse vocabularies already developed and maintained externally. As a result, it should be primarily reserved for fields which effect retrieval or for which you want to be able to generate tailored administrative reports and other outputs. A good example of this would be where you receive data from multiple agencies or individuals and you want to track the source of your data for tax, acknowledgement or other purposes. Another example would be where you have only five possible digital rights or copyright statements and you don’t want to have to reenter this data each time you create a new record. Of course, the most obvious example is where you want to make use of an externally maintained vocabulary – such as Library of Congress or Medical Subject Headings, the Art and Architecture Thesaurus, or GeoNames – so that you can make use of the work of others in your field of interest and present terms that are familiar to your end users. In some cases, these external sources also provide useful Web services that may enable your IT staff to automatically extract changes to the vocabulary made by the thesaurus maintenance agency, or to provide additional metadata, such as maps based on coordinate information stored within the parent thesaurus but not locally.

Architectural issues for authority file building & linking

The size and location of the authority file and the structure of your data store will largely determine what your options are for building, maintaining, and linking controlled vocabularies to your metadata creation project.

For small vocabularies, lists might be built into field definitions within a database, programmed as drop-down values on a data entry form, or listed next to the appropriate box on a Web input form. This method should only be used where you don’t need to store additional administrative information about the terms used, e.g., source or creator of data, source vocabulary, scope and description. The ideal would be just the values themselves, or a code plus the term you want displayed. Here are examples where this solution works best:

Date ranges (by decade, century, artistic or cultural period, etc.)
Format, data type, data source
Metadata creators or editors
Rights statements

Initials	Metadata creator
ab	Black, Adrian
sc	Cole, Stephen
tm	Matthews, Teresa
as	Smith, Albert

Code	Date range
3	1931-1940
4	1941-1950
5	1951-1960
6	1961-1970

When using larger vocabularies where the vocabulary is dynamic or needs to support cross-references from variant forms of name, you will need easy access to the vocabulary for maintenance purposes. Here, it is better to have the list separated from the main data store itself, either as a separate table within an underlying database structure, or as pointers from your metadata to an external source of terms using a URI, system control number, or other unique identifier. A relational database structure is ideal for this, in that changes to the separate vocabulary can effect global changes to linked records without having to individually edit them all. This is how most integrated library systems handle name, subject and series entries, where a change from one term to another in the authority file can cause thousands, or even millions, of headings to change in related bibliographic records with little or no human intervention. Here are examples of where this solution works best:

Personal, corporate, and geographic names
Subject terms
Standardized data/mime types

When using an external vocabulary, such as Library of Congress Subject Headings (LCSH), GeoNames, or the Getty Institute’s Union List of Artist Names (ULAN), you should record not only the form of name heading that you are using from that vocabulary, but also whatever system number applies within the external file. This will enable you to build Web services so that the terms can be maintained automatically as the source vocabulary changes. If you are unable to find a term within the selected vocabulary that matches your needs, mark the term as “local” so that these can be checked against the source vocabulary periodically, in case the term is incorporated there later.

For larger vocabularies, here are some elements you will typically want to capture:

Authorized term or heading
Internal (local) ID number
Source vocabulary (this itself can be linked to its own authority file, where you might store serviceable information such as base URL)
Source ID number
Metadata creator
Date captured/last revised
Type of term
Unauthorized equivalent terms (cross-references)
Broader and/or narrower terms
Scope/description of resource named in heading (historic/descriptive notes)

So, an example of this type of controlled vocabulary could look like this:

Local ID	Subject	Type	Source	Source ID	Creator	DateRev	Scope
57894	4-H clubs	topical	lcsh	sh 85000002	cp	20020718	Use for the 4-H movement in general
08756	African American college athletes	topical	lcsh	sh2004010437	ed	20050620
23111	Albumen prints	genre	lctgm	tgm000227	sh	20090330
00399	Big Savannah (N.C.)	geographic	local		cp	20100823

Links to this table from controlled lists of “type”, “source”, and “creator” could also be made, to ensure consistency in form of these terms.

Source ID	Source Code	Source	Base URI
0093	oclc	WorldCat Identities	http://www.worldcat.org/identities/
0009	lcnaf	Library of Congress Name Authority File	http://id.loc.gov/authorities/
0033	lcsh	Library of Congress Subject Headings	http://id.loc.gov/authorities/
0002	lctgm	Thesaurus for Graphic Materials	http://www.loc.gov/pictures/item/
0001	local	Local use

Consultation in building these types of data structures is available from Metadata and Data Quality within Metadata and Cataloging, as well as from the Digital Libraries Initiatives and Information Technology Departments.

Sources for name headings

Name headings include personal, corporate, series, title, and meeting headings. Personal names include the names of individuals and families. Corporate names may be names of government agencies, universities or departments within universities, business names, churches, associations, musical groups, and any other aggregates with a corporate identity. In Library of Congress metadata practices, named boats and buildings are considered corporate names. Meetings include academic conferences, as well as specific sporting events, World Fairs, and similar activities.

Sources for name headings	Abbreviation	URI
Library of Congress Name Authorities	lcnaf	http://authorities.loc.gov/, http://id.loc.gov/authorities/
Union List of Artist Names	ulan	http://www.getty.edu/research/conducting_research/vocabularies/ulan/
WorldCat Identities		http://www.worldcat.org/identities/

Sources for geographic headings

Geographic headings include names of planets, continents, countries, states, counties, cities, islands, bodies of water, and other topographic features. Depending on how you intend to use these headings, they can be used directly, as in “Raleigh, N.C.”, or hierarchically, as in “United States—North Carolina—Wake County—Raleigh.” How you structure these headings will depend on how you want to be able to search and/or retrieve them. If you wish to take the hierarchical approach, it can be useful to build that into your authority file so that headings could be searched in either way, but metadata creators only have to enter them directly. This will speed up data entry, without losing the power to narrow the search starting at the highest level and proceeding to the local level.

Sources for geographic headings	Abbreviation	URI
GeoNames	geonames	http://www.geonames.org/
Library of Congress Name Authorities	lcnaf	http://authorities.loc.gov/ http://id.loc.gov/authorities/
USGS Geographic Names Information System	gnis	http://geonames.usgs.gov/
Thesaurus for Geographic Names	tgn	http://www.getty.edu/research/conducting_research/vocabularies/tgn/

Sources for subject terms

Subject terms or headings can be at various levels of specificity, depending on the size of your data store and on your retrieval system’s search capabilities. The larger the data store, the more granular the subject system should be, to limit the number of postings on any given term. Obviously, if your entire data store is on the topic of “North Carolina” you don’t need to supply that as a subject heading on each record if the file will only reside within a local retrieval system. However, if the entire collection is going to be added to a more general data store, such as the Library’s catalog or OCLC WorldCat, then you will need that subject to distinguish this collection from others.

Bear in mind that many subject lists that are available to the public have complex rules for formulation of headings. Some allow subdivisions that can be either precoordinated (the entire string is specified in the vocabulary) or post-coordinated (you can synthesize the heading by combining a topical heading with a geographic, genre, form or date subdivision). Allowing for and controlling this sort of term combination is one of the more difficult architectural issues for most retrieval systems.

In assigning subject terms, you first need to establish a framework for which to do this. The cataloging community has created complex guidelines for subject analysis of books and other print materials, and has attempted to extend this to new media over the last few years, with varying degrees of success. Subject analysis for books is based on the “whole book” concept, which says that subject headings should be at the level of specificity of the material in hand. If the book is on “winter sports” and there is a subject heading at that level, then there is no need to also use the subject headings “Hockey”, “Skis and skiing”, “Skating”, “Bobsledding”, “Tobogganing”, and “Snowmobiling”. These are narrower terms on the authority record for “Winter sports”. Additional subject headings may be assigned for portions of the book, but there are requirements for the percentage of the book this must be before the subjects can be assigned. Of course, local needs override all of these rules, and this may be the case for your project as well. Metadata and Cataloging can help you work through these issues and achieve a balance between overassignment of subject terms and underassignment.

Sources for subject terms	Abbrev	Discipline	URI
Library of Congress Subject Headings	lcsh	general	http://authorities.loc.gov/, http://id.loc.gov/authorities/
Art & Architecture Thesaurus	aat	art, design, architecture	http://www.getty.edu/research/conducting_research/vocabularies/aat/
CAB Thesaurus	cab	agriculture, life sciences	http://www.cabi.org/cabthesaurus/
INSPEC Thesaurus	inspec	computer science,telecommunications	http://www.theiet.org/publishing/inspec/products/range/thesaurus.cfm, http://www.theiet.org/publishing/inspec/products/range/thesxml.cfm
Medical Subject Headings	mesh	health, life sciences, medicine	http://www.nlm.nih.gov/mesh/
Thesaurus for Graphic Materials I	tgm1	graphics	http://lcweb.loc.gov/rr/print/tgm1/

Sources for additional controlled subject vocabularies	URI
Online Thesauri & Authority Files	http://www.asindexing.org/site/thesonet.shtml
Taxonomy Warehouse	http://www.taxonomywarehouse.com/
Taxonomy ShareSpace	http://www.taxobank.org/
Taxonomies & Controlled Vocabularies SIG, ALA	http://www.taxonomies-sig.org/links.htm

Sources for media types, genre/format, and other resource attributes

Depending on the size of your project and the nature of its contents, you may or may not want to set up lists to control the attributes of your data. If your entire project is describing streaming media, you probably don’t need to identify the format of individual resources. However, if you are attempting to control more than one file type, media type, genre and so on, there are vocabularies available for that.

Sources for name headings	Abbreviation	URI
LC Genre and Form Thesaurus	lcgft	http://authorities.loc.gov/, http://id.loc.gov/authorities/
MARC Genre Term List	marcgt	http://www.loc.gov/standards/valuelist/marcgt.html
Mime media types	mime	http://www.iana.org/assignments/media-types/, http://www.htmlquick.com/reference/mime-types.html
Moving Image Genre List	miggen	http://www.loc.gov/rr/mopic/miggen.html
RDA carrier types	rdacarrier	http://www.loc.gov/standards/valuelist/rdacarrier.html
RDA content types	rdacontent	http://www.loc.gov/standards/valuelist/rdacontent.html
RDA media types	rdamedia	http://www.loc.gov/standards/valuelist/rdamedia.html
Source codes for vocabularies, rules & schemes		http://www.loc.gov/standards/sourcelist/
Thesaurus for Graphic Materials II: Genre & Physical Characteristic Terms	tgm2	http://www.loc.gov/rr/print/tgm2/