Sign up
internetarchive
/
wayback
Public

forked from iipc/openwayback
master
wayback​/​wayback-cdx-server​/
This branch is 221 commits ahead, 639 commits behind iipc:master.
#223
kngenie
on Feb 14, 2020
Type
Name
Commit time
. .
src2 years ago
README.md3 years ago
pom.xml7 years ago
README.md
Wayback CDX Server API - BETA
Changelist
Table of Contents
Intro and Usage
Advanced Usage
Intro and Usage
The wayback-cdx-server is a standalone HTTP servlet that serves the index that the wayback machine uses to lookup captures.
The index format is known as 'cdx' and contains various fields representing the capture, usually sorted by url and date. http://archive.org/web/researcher/cdx_file_format.php
The server responds to GET queries and returns either the plain text CDX data, or optionally a JSON array of the CDX.
The CDX server is deployed as part of web.archive.org Wayback Machine and the usage below reference this deployment.
However, the cdx server is freely available with the rest of the open-source wayback machine software in this repository.
Further documentation will focus on configuration and deployment in other environments.
Please contant us at wwm@archive.org for additional questions.
Basic Usage
The most simple query and the only required param for the CDX server is the url param
http://web.archive.org/cdx/search/cdx?url=archive.org
The above query will return a portion of the index, one per row, for each 'capture' of the url "archive.org" that is available in the archive.
The columns of each line are the fields of the cdx. At this time, the following cdx fields are publicly available:
["urlkey","timestamp","original","mimetype","statuscode","digest","length"]
It is possible to customize the Field Order as well.
The the url= value should be url encoded if the url itself contains a query.
All other params are optional and are explained below.
For doing large/bulk queries, the use of the Pagination API is recommended.
Url Match Scope
The default behavior is to return matches for an exact url. However, the cdx server can also return results matching a certain prefix, a certain host or all subdomains by using the matchType= param.
For example, if given the url: archive.org/about/ and:
The matchType may also be set implicitly by using wildcard '*' at end or beginning of the url:
(Note: The domain mode is only available if the CDX is in SURT-order format.)
Output Format (JSON)
Output: output=json can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format.
Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3
[["urlkey","timestamp","original","mimetype","statuscode","digest","length"], ["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"], ["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"], ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"]]
By default, CDX server returns gzip encoded data for all queries. To turn this off, add the gzip=false param
Field Order
It is possible to customize the fields returned from the cdx server using the fl= param. Simply pass in a comma separated list of fields and only those fields will be returned:
Filtering
Collapsing
A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures.
To use collapsing, add one or more collapse=field or collapse=field:N where N is the first N characters of field to test.
Query Result Limits
As the CDX server may return millions or billions of record, it is often necessary to set limits on a single query for practical reasons. The CDX server provides several mechanisms, including ability to return the last N as well as first N results.
Advanced Usage
The following features are for more specific/advanced usage of the CDX server.
Resumption Key
There is also a new method that allows for the CDX server to specify 'resumption key' that can be used to continue the query from the previous end. This allows breaking up a large query into smaller queries more efficiently. This can be achieved by using showResumeKey= and resumeKey= params
org,archive)/ 19970126045828 http://www.archive.org:80/ text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415 org,archive)/ 19971011050034 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402 org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405 org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405 org,archive)/ 19980109140106 http://archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402 org%2Carchive%29%2F+19980109140106%21
JSON example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&output=json
[["urlkey","timestamp","original","mimetype","statuscode","digest","length"], ["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"], ["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"], ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"], ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"], ["org,archive)/", "19980109140106", "http://archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"], [], ["org%2Carchive%29%2F+19980109140106%21"]]
In a subsequent query, adding resumeKey= will resume the search from the next result: No other params from the original query (such as from= or url=) need to be altered To continue from the previous example, the subsequent query would be:
Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&resumeKey=org%2Carchive%29%2F+19980109140106%21
Counters
There is some work on custom counters to enchance the aggregation capabilities of CDX server. These features are brand new and should be considered experimental.
Duplicate Counter
While collapsing allows for filtering out adjacent results that are duplicates, it is also possible to track duplicates throughout the cdx using a special new extension. By adding the showDupeCount=true a new dupecount column will be added to the results.
Skip Counter
It is possible to track how many CDX lines were skipped due to Filtering and Collapsing by adding the special skipcount counter with showSkipCount=true. An optional endtimestamp count can also be used to print the timestamp of the last capture by adding lastSkipTimestamp=true
Ex: Collapse results by year and print number of additional captures skipped and timestamp of last capture:
http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=timestamp:4&output=json&showSkipCount=true&lastSkipTimestamp=true
Pagination API
The above resume key allows for sequential querying of CDX data. However, in some cases where very large querying is needed (for example domain query), it may be useful to perform queries in parallel and also estimate the total size of the query.
wayback and cdx-server support a secondary loading from a 'zipnum' CDX index. This index contains CDX lines stored in concatenated GZIP blocks (usually 3,000 lines each) and a secondary index which provides binary search to the 'zipnum' blocks. By using the secondary index, it is possible to estimate the total size of a query and also break up the query in size. Using the zipnum format or other secondary index is needed to support pagination.
However, pagination can only work on a single index at a time; merging input from multiple sources (plain cdx or zipnum) is not possible. As such, the results from a paginated query may be slightly less up-to-date than a default non-paginated query.
Access Control
The cdx server is designed to improve access to archived data to a broad audience, but it may be necessary to restrict certain parts of the cdx.
The cdx server provides greanting permissions to restricted data via an API key that is passed in as a cookie.
Currently two restrictions/permission types are supported:
To allow access, the API key cookie must be explicitly set on the client, eg:
curl -H "Cookie: cdx-auth-token=API-Key-Secret http://mycdxserver/search/cdx?url=..."
The API-Key-Secret can be set in the cdx server configuration.
CDX Server Configuration
TODO
Start by editing the wayback-cdx-server-servlet.xml File in the WEB-INF Directory. Just put some valid CDX-Files in the cdxUris-List (Files must end with cdx or cdx.gz!)
© 2021 GitHub, Inc.
Terms
Privacy
Security
Status
Docs
Contact GitHubPricingAPITrainingBlogAbout
CodeCodeIssuesIssues74Pull requestsPull requests10ActionsActionsProjectsProjectsWikiWikiSecuritySecurityInsightsInsights Code Issues Pull requests Actions Projects Wiki Security Insights