Sign up
internetarchive
/
crawling-for-nomore404
13 stars
9 forks
Star
Notifications
master
Go to file
on Mar 13, 2020
README.md
crawling-for-nomore404
Crawling-related code for no-more-404s projects.
There are multiple projects that are mostly independent of each other. Here are a summary of each projects. Look for README in respective project subdirectory for more details.
wikipedia
this project scrapes wikipedia IRC channel for updated article, extracts newly added citations, and feed those URLs for crawling. scraper and crawl-scheduler are communicating through Kafka messaging, so other apps can also read a feed of new citations as well as original IRC notifications.
wordpress
this project reads WordPress's official blog update stream, and schedules each permalink URL of new post for crawling. it is implemented as single application at this moment.
Releases
No releases published
Packages
No packages published
Contributors 5
Languages
Other1.2%
© 2021 GitHub, Inc.
Terms
Privacy
Security
Status
Docs
Contact GitHubPricingAPITrainingBlogAbout
CodeCodeIssuesIssues2Pull requestsPull requests5ActionsActionsProjectsProjectsWikiWikiSecuritySecurityInsightsInsights Code Issues Pull requests Actions Projects Wiki Security Insights