Sign up
internetarchive
/
crawling-for-nomore404
master
crawling-for-nomore404​/​wikipedia​/​producer​/​README.md
1 contributor
8 lines (5 sloc) 752 Bytes
Vinay Goel (vinay@archive.org)
This project contains scripts to extract links from live Wikipedia edits. It leverages the Wikipedia Live Monitor project (​https://github.com/tomayac/wikipedia-irc​)
The "monitor" application monitors article edits as they occur in realtime on all the different language Wikipedia versions. If the edit includes URL references, then these links are extracted. In the case of "new" articles, the Wikipedia API is used to extract all the external links on the article page.
Includes code that pushes the data produced by the monitor into an Apache Kafka distributed queue (using the Kafka python client: https://github.com/mumrah/kafka-python​) and a simple Apache Pig script to generate basic stats from the links data.
© 2021 GitHub, Inc.
Terms
Privacy
Security
Status
Docs
Contact GitHubPricingAPITrainingBlogAbout
CodeCodeIssuesIssues2Pull requestsPull requests5ActionsActionsProjectsProjectsWikiWikiSecuritySecurityInsightsInsights Code Issues Pull requests Actions Projects Wiki Security Insights