Internet Archive Blogs
A blog from the team at archive.org
Defining Web pages, Web sites and Web captures
Posted on October 23, 2016 by Vinay Goel

The Internet Archive has been archiving the web for
20 years and has preserved billions of webpages from millions of websites. These webpages are often made up of, and link to, many images, videos, style sheets, scripts and other web objects. Over the years, the Archive has saved over 510 billion such time-stamped web objects, which we term web captures.
We define a webpage as a valid web capture that is an HTML document, a plain text document, or a PDF.
A domain on the web is an owned section of the internet namespace, such as google.com or archive.org or bbc.co.uk. A host on the web is identified by a fully qualified domain name or FQDN that specifies its exact location in the tree hierarchy of the Domain Name System. The FQDN consists of the following parts: hostname and domain name.  As an example, in case of the host blog.archive.org, its hostname is blog and the host is located within the domain archive.org.
We define a website to be a host that has served webpages and has at least one incoming link from a webpage belonging to a different domain.
As of today, the Internet Archive officially holds 273 billion webpages from over 361 million websites, taking up 15 petabytes of storage.
Posted in Announcements, News, Wayback Machine - Web Archive | 4 Replies

About Vinay Goel
Web Search & Data Mining Lead, Senior Data Engineer
View all posts by Vinay Goel →
4 thoughts on “Defining Web pages, Web sites and Web captures
Good job guys! Interesting facts about archiving!
Pingback: Beta Wayback Machine – Now with Site Search! | Internet Archive Blogs
Pingback: WOW! New Beta Allows Users to Keyword Search a Limited Amount of Material in The Wayback Machine | LJ INFOdocket
Pingback: Internet Archive – Treasure | Web Search Guide and Internet News
Comments are closed.
Recent Posts
Community Webs Seeks Applicants from the US, Canada and Around the World
University Professor Leverages 78rpm Record Collection From the Internet Archive for Student Podcasts
“Hello (again), World!” SF Party Tuesday June 15 6p – 8:30p
Burning the Books: A Conversation with Richard Ovenden & Abby Smith Rumsey
Game Not Over! Fireside chat and panel with John Carmack
Recent Comments
Parsia on Burning the Books: A Conversation with Richard Ovenden & Abby Smith Rumsey
Mansour on Community Webs Seeks Applicants from the US, Canada and Around the World
Macrocell on Burning the Books: A Conversation with Richard Ovenden & Abby Smith Rumsey
Nandahemsireliktanilari on Burning the Books: A Conversation with Richard Ovenden & Abby Smith Rumsey
bourse on Burning the Books: A Conversation with Richard Ovenden & Abby Smith Rumsey
Categories
78rpm
Announcements
Archive Version 2
Archive-It
Audio Archive
Books Archive
Cool items
Education Archive
Emulation
Event
Image Archive
Jobs
Lending Books
Live Music Archive
Movie Archive
Music
News
Newsletter
Open Library
Past Event
Software Archive
Technical
Television Archive
Upcoming Event
Video Archive
Wayback Machine – Web Archive
Web & Data Services
Archives
Meta
Log in
Entries feed
Comments feed
WordPress.org
Proudly powered by WordPress
Skip to contentBlogAnnouncementsInternet Archive Storearchive.orgAboutEventsDevelopersDonate