Memento Tracer: High Quality Web Archiving at Scale
What - The Memento Tracer Framework
The Memento Tracer framework introduces a new collaborative approach to capture web publications for archival purposes. It is inspired by existing capture approaches yet aims for a new balance between the scale at which capturing can be conducted and the quality of the snapshots that result.
Like existing web crawler approaches, Memento Tracer uses server-side processes to capture web publications. As is the case with LOCKSS, these processes leverage the insight that web publications in any given portal are typically based on the same template and hence share features such as lay-out and interactive affordances. As is the case with webrecorder.io, human guidance helps to achieve high quality captures. But with Memento Tracer, heuristics that apply to an entire class of web publications are recorded, not individual web publications. These heuristics can collaboratively be created by curators and deposited in a shared community repository. When the server-side capture processes come across a web publication of a class for which heuristics are available, they can apply them and hence capture faithfull snapshots at scale.
Why - Scale and Quality in Web Archiving
In order to archive the essence of a web publication, a range of web resources need to be captured. But, many times, capturing those resources is not trivial.
Memento Tracer is inspired by all these approaches but aims to strike a new balance between the scale of capturing and the quality of resulting captures.
How - The Three Components of Memento Tracer
The Memento Tracer frameworks consists of:
- A component, e.g., a browser extension that records Traces; a Trace is a set of instructions for capturing the essence of web publications of a certain class.
- A component, e.g., a shared repository where the community uploads and downloads Traces.
- A component, e.g., a headless browser extension that uses Traces as guidance in the process that navigates and captures web publications.
Step 1 - A Curator Records a Trace Using a Browser Extension
A curator who anticipates the need to collect a significant amount of web publications from a certain portal over time, navigates towards a publication that is representative for a class of publications in that portal. For example, if there is a need to web archive slideshare presentations, the curator might navigate towards the landing page for the Creating Pockets of Persistence
presentation. On that page, the curator activates the Memento Tracer browser extension to start recording a Trace for the page by interacting with it. The extension does not record the actual resources or URLs that are traversed by the curator. Rather, the extension's browser event listener captures mouse actions and records those abstractly in terms that uniquely identify the page's elements that are being interacted with, e.g. by means of their class ID
. Since all pages of the same class are based on the same template, the resulting Traces apply across all pages of the class rather than to this specific page only. Currently, in addition to recording simple mouse-clicks, the extension is able to record - with a single interaction by the curator - the notion of repeated clicks (e.g., navigate through all slides of the presentation) and clicks on all links in a certain user interface component. For example, below is a Trace that results from the curator indicating that the "next slide" button should be clicked repeatedly. Note that the Trace also indicates the URL pattern to which the Trace applies, and provenance information including the resource on which the Trace was created and the user agent used to create it. When the lay-out and/or affordances for a particular class of web publications changes, a new Trace has to be recorded to ensure that captures maintain their high quality.
"value": "div.notranslate.transcript.add-padding-right.j-transcript a",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/68.0.3417.0 Safari/537.36"
Step 2 - A Curator Uploads the Trace to a Shared Repository
Once a Trace is successfully recorded, the curator uploads it into a shared community repository. This can, for example, be done by means of a pull request to a GitHub repository, which is subsequently evaluated by the maintainers of the repository. The organization of the repository allows to quickly locate Traces for specific classes of pages and by specific curators. Since the perspective of what the essence of a web publication is may differ from one curator to the next, the repository supports multiple Traces for a specific class of pages. Each can be unambiguously identified in the repository. Also, since the layout of pages evolves over time, Traces will need updating. This makes version support by the repository essential.
Step 3 - An Operator of a Headless Browser Set-Up Uses Traces from the Shared Repository
In order to generate captures of web publications, the Memento Tracer framework assumes a set-up consisting of a web-driver (e.g., Selenium) that allows automating the actions of a headless browser (e.g., PhantomJS) combined with a capturing tool (e.g., WarcProxy) that writes resources navigated by the headless browser to a WARC file. An operator of this set-up selects Traces for those classes of pages that are frequently crawled and for which high quality snapshots are required. From then onwards, when the fully automated set-up happens upon a web publication that resorts under a class of pages for which a Trace is available (e.g. its URL resembles the URL of the web publication for which the Trace was recorded), that Trace will be invoked to guide the capturing process.
When - Current Status
It is hard to say when Memento Tracer will be ready for a test ride, let alone for prime time. The components are currently experimental but we are making promising progress. The process of recording Traces and capturing web publications on the basis of these Traces has been demonstrated successfully for publications in a range of portals. But there also remain challenges that we are investigating, including:
- User interface to support recording Traces for complex sequences of interactions.
- Limitations of the browser event listener approach for recording Traces.
- Language used to express Traces.
- Organization of the shared repository for Traces.
- Selection of a Trace for capturing a web publication in cases where different page layouts and interactive affordances are available for web publications that share a URI pattern.
Last update: May 10 2018