Sign up
internetarchive
/
cdx-summary
Public
Summarize web archive capture index (CDX) files.
AGPL-3.0 license
11 stars
1 fork
main
Go to file
ibnesayeed
on Jul 29
README.md
CDX Summary
Summarize web archive capture index (CDX) files.
Installation
$ pip install cdxsummary
Alternatively, install from the source.
$ python3 setup.py install
To run the tool as a one-off Docker container, build the image as following, which will place the cdxsummary executable as the entrypoint script of the container.
$ docker image build -t cdxsummary . $ docker container run -it --rm cdxsummary
Features
Usage
$ cdxsummary --help usage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input] Summarize web archive capture index (CDX) files. positional arguments: input CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-') optional arguments: -h, --help show this help message and exit -a [QUERY], --api [QUERY] CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL -i, --item Treat the input argument as a Petabox item identifier instead of a file path -j, --json Generate summary in JSON format -l, --load Load JSON report instead of CDX -o [FILE], --out [FILE] Write output to the given file (default: STDOUT) -r, --report Generate non-summarized JSON report -s [N], --samples [N] Number of sample memento URLs in summary (default: 10) -t [N], --tophosts [N] Number of hosts with maximum captures in summary (default: 10) -v, --version Show version number
Sample Output
Plain Text Summary
$ cdxsummary sample.cdx.gz
JSON Summary
$ cdxsummary --json sample.cdx.gz
Testing
An interactive test interface is available for the Web Component that renders the JSON summary.
Releases
12 tags
Packages
No packages published
Languages
© 2022 GitHub, Inc.
Terms
Privacy
Security
Status
Docs
Contact GitHub
Pricing
API
Training
Blog
About
Actions Automate any workflow Packages Host and manage packages Security Find and fix vulnerabilities Codespaces Instant dev environments Copilot Write better code with AI Code review Manage code changes Issues Plan and track work Discussions Collaborate outside of code All features Documentation GitHub Skills Blog Enterprise Teams Compare all CI/CD & Automation DevOps DevSecOps Customer Stories Resources GitHub Sponsors Fund open source developers The ReadME Project GitHub community articles Topics Trending Collections Pricing
CodeCodeIssuesIssues1Pull requestsPull requestsActionsActionsProjectsProjectsSecuritySecurityInsightsInsights Code Issues Pull requests Actions Projects Security Insights