Sign up
internetarchive
/
crawling-for-nomore404
master
crawling-for-nomore404​/​wikipedia​/​producer​/
Type
Name
Commit time
. .
.gitignore3 years ago
Dockerfile2 years ago
README.md8 years ago
docker-compose.yml3 years ago
monitor.js2 years ago
package-lock.json3 years ago
package.json3 years ago
process-file.js2 years ago
processor.js2 years ago
producer.js2 years ago
producer.py5 years ago
run-monitor.sh5 years ago
setup.py5 years ago
wikipedias.js3 years ago
README.md
Vinay Goel (vinay@archive.org)
This project contains scripts to extract links from live Wikipedia edits. It leverages the Wikipedia Live Monitor project (​https://github.com/tomayac/wikipedia-irc​)
The "monitor" application monitors article edits as they occur in realtime on all the different language Wikipedia versions. If the edit includes URL references, then these links are extracted. In the case of "new" articles, the Wikipedia API is used to extract all the external links on the article page.
Includes code that pushes the data produced by the monitor into an Apache Kafka distributed queue (using the Kafka python client: https://github.com/mumrah/kafka-python​) and a simple Apache Pig script to generate basic stats from the links data.
© 2021 GitHub, Inc.
Terms
Privacy
Security
Status
Docs
Contact GitHubPricingAPITrainingBlogAbout
CodeCodeIssuesIssues2Pull requestsPull requests5ActionsActionsProjectsProjectsWikiWikiSecuritySecurityInsightsInsights Code Issues Pull requests Actions Projects Wiki Security Insights