Skip to main content
Log in

The evolution of web archiving

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Web archives preserve information published on the web or digitized from printed publications. Much of this information is unique and historically valuable. However, the lack of knowledge about the global status of web archiving initiatives hamper their improvement and collaboration. To overcome this problem, we conducted two surveys, in 2010 and 2014, which provide a comprehensive characterization on web archiving initiatives and their evolution. We identified several patterns and trends that highlight challenges and opportunities. We discuss these patterns and trends that enable to define strategies, estimate resources and provide guidelines for research and development of better technology. Our results show that during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://en.wikipedia.org/wiki/Digital_obsolescence.

  2. http://www.bbc.com/news/science-environment-31450389.

  3. E.g., Internet Archive available at http://www.archive.org.

  4. E.g., Library of Congress Web Archives available at http://www.loc.gov/minerva.

  5. http://www.thetimes.co.uk/tto/archive/.

  6. http://pandora.nla.gov.au.

  7. http://www.loc.gov/minerva.

  8. http://info.cern.ch/hypertext/WWW/TheProject.html.

  9. http://timeline.webarchivists.org.

  10. https://perma.cc/.

  11. http://webcitation.org/.

  12. http://archive.is/.

  13. http://www.loc.gov/webarchiving.

  14. http://www.netpreserve.org/web-archiving/tools-and-software.

  15. http://www.webarchive.org.uk/ukwa/visualisation.

  16. http://www.scape-project.eu.

  17. http://timetravel.mementoweb.org.

  18. http://www.nla.gov.au/padi.

  19. http://iwaw.europarchive.org.

  20. http://en.wikipedia.org/wiki/List_of_Web_Archiving_Initiatives.

  21. http://aleph-archives.com/.

  22. http://www.hanzoarchives.com/.

  23. http://www.reedarchives.com/.

  24. http://www.archive-it.org.

  25. http://archivethe.net.

  26. http://webarchives.cdlib.org.

References

  1. Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? The evolution of the web from a search engine perspective. In: Proc. of the 13th International Conference on World Wide Web, pp. 1–12 (2004)

  2. Dellavalle, R., Hester, E., Heilig, L., Drake, A., Kuntzman, J., Graber, M., Schilling, L.: Going, going, gone: lost internet references. Science 302(5646), 787–788 (2003)

    Article  Google Scholar 

  3. SalahEldeen, H., Nelson, M.: Losing my revolution: how many resources shared on social media have been lost? In: Theory and Practice of Digital Libraries, pp. 125–137 (2012)

  4. UNESCO: Charter on the preservation of digital heritage. In: Adopted at the 32nd Session of the General Conference of UNESCO (2003). http://portal.unesco.org/ci/en/files/13367/10700115911Charter_en.pdf/Charter_en.pdf. Accessed 17 Oct 2003

  5. UNESCO: Universal declaration on archives. In: Adopted at the ICA Annual General Meeting in Malta (2010). http://www.ica.org/6573/reference-documents/universal-declaration-on-archives.html. Accessed 17 Sept 2010

  6. Kitsuregawa, M., Tamura, T., Toyoda, M., Kaji, N.: Socio-sense: a system for analysing the societal behavior from long term web archive. In: Proc. of the 10th Asia-Pacific Web Conference on Progress in WWW Research and Development, pp. 1–8 (2008)

  7. Arms, W.Y., Aya, S., Dmitriev, P., Kot, B., Mitchell, R., Walle, L.: A research library based on the historical collections of the Internet Archive. D-Lib Mag. 12(2) (2006)

  8. Arms, W., Huttenlocher, D., Kleinberg, J., Macy, M., Strang, D.: From Wayback Machine to Yesternet: new opportunities for social science. In: Proc. of the 2nd International Conference on e-Social Science (2006)

  9. Ackland, R.: Virtual observatory for the study of online networks (VOSON)—progress and plans. In: Proc. of the 1st International Conference on e-Social Science (2005)

  10. Foot, K., Schneider, S.: Web Campaigning. The MIT Press, Cambridge (2006)

    Google Scholar 

  11. Franklin, M.: Postcolonial Politics, the Internet, and Everyday Life: Pacific Traversals Online. Routledge (2004)

  12. Gomes, D., Costa, M.: The importance of web archives for humanities. Int. J. Humanit. Arts Comput. 8(1), 106–123 (2014)

    Article  Google Scholar 

  13. Yamamoto, Y., Tezuka, T., Jatowt, A., Tanaka, K.: Honto? Search: estimating trustworthiness of web information by search results aggregation and temporal analysis. In: Advances in Data and Web Management, pp. 253–264 (2007)

  14. Chung, Y., Toyoda, M., Kitsuregawa, M.: A study of link farm distribution and evolution using a time series of web snapshots. In: Proc. of the 5th International Workshop on Adversarial Information Retrieval on the Web, pp. 9–16 (2009)

  15. Elsas, J., Dumais, S.: Leveraging temporal dynamics of document content in relevance ranking. In: Proc. of the 3rd ACM International Conference on Web Search and Data Mining, pp. 1–10 (2010)

  16. Radinsky, K., Horvitz, E.: Mining the web to predict future events. In: Proc. of the 6th ACM International Conference on Web Search and Data Mining, pp. 255–264 (2013)

  17. Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: Proc. of the International Conference on Theory and Practice of Digital Libraries, pp. 408–420 (2011)

  18. Costa, M., Couto, F.M., Silva, M.J.: Learning temporal-dependent ranking models. In: Proc. of the 37th Annual ACM SIGIR Conference (2014)

  19. Masanès, J.: Web Archiving. Springer, New York (2006)

    Book  Google Scholar 

  20. Kahle, B.: Wayback machine: now with 240,000,000,000 (2013). http://blog.archive.org/2013/01/09/updated-wayback/. Accessed 30 Apr 2016

  21. Grotke, A.: IIPC—2008 member profile survey results. Technical report, International Internet Preservation Consortium (IIPC) (2008)

  22. Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PloS One 9(12), 1–39 (2014)

  23. Lazun, M.J.: “Link Rot” and legal resources on the web: a 2013 analysis by the chesapeake digital preservation group. Technical Report, The Chesapeake Digital Preservation Group (2013)

  24. Tofel, B.: ‘Wayback’ for accessing web archives. In: Proc. of the 7th International Web Archiving Workshop (2007)

  25. Jaffe, E., Kirkpatrick, S.: Architecture of the Internet Archive. In: Proc. of SYSTOR 2009: The Israeli Experimental Systems Conference, pp. 1–10 (2009)

  26. Internet Memory Foundation: Web archiving in Europe. Technical Report, Internet Memory Foundation (2010)

  27. Niu, J.: Functionalities of web archives. D-Lib Mag. 18(3/4) (2012)

  28. Ras, M., van Bussel, S.: Web archiving user survey. Technical Report, National Library of the Netherlands (Koninklijke Bibliotheek) (2007)

  29. Costa, M., Silva, M.J.: Characterizing search behavior in web archives. In: Proc. of the 1st International Temporal Web Analytics Workshop, pp. 33–40 (2011)

  30. Costa, M., Silva, M.J.: Evaluating web archive search systems. In: Proc. of the 13th International Conference on Web Information Systems Engineering, pp. 440–454 (2012)

  31. Thomas, A., Meyer, E.T., Dougherty, M., Van den Heuvel, C., Madsen, C., Wyatt, S.: Researcher engagement with web archives: challenges and opportunities for investment. Technical Report, Joint Information Systems Committee (JISC) (2010)

  32. Spaniol, M., Masanès, J., Baeza-Yates, R.: The 5th temporal web analytics workshop (tempweb’15). In: Proc. of the Companion Publication of the 24th International Conference on World Wide Web, pp. 863–864 (2015)

  33. Spaniol, M., Masanès, J., Baeza-Yates, R.: The 4th temporal web analytics workshop (tempweb’14). In: Proc. of the Companion Publication of the 23rd International Conference on World Wide Web, pp. 863–864 (2014)

  34. Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506 (2009)

  35. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  36. Matthews, M., Tolchinsky, P., Blanco, R., Atserias, J., Mika, P., Zaragoza, H.: Searching through time in the New York Times. In: Proc. of the 4th Workshop on Human–Computer Interaction and Information Retrieval, pp. 41–44 (2010)

  37. Adar, E., Dontcheva, M., Fogarty, J., Weld, D.S.: Zoetrope: interacting with the ephemeral web. In: Proc. of the 21st Annual ACM Symposium on User Interface Software and Technology, pp. 239–248 (2008)

  38. Teevan, J., Dumais, S., Liebling, D., Hughes, R.: Changing how people view changes on the web. In: Proc. of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246 (2009)

  39. Masanès, J.: LiWA news #3: living web archives (2011). http://liwa-project.eu/images/videos/Liwa_Newsletter-3.pdf. Accessed March 2011

  40. Weikum, G., Ntarmos, N., Spaniol, M., Triantafillou, P., Benczur, A.A., Kirkpatrick, S., Rigaux, P., Williamson, M.: Longitudinal analytics on web archive data: it’s about time! In: Proc. of the 5th Conference on Innovative Data Systems Research, pp. 199–202 (2011)

  41. Huurdeman, H.C., Ben-David, A., Sammar, T.: Sprint methods for web archive research. In: Proc. of the 5th Annual ACM Web Science Conference, pp. 182–190 (2013)

  42. Risse, T., Peters, W.: ARCOMEM: from collect-all ARchives to COmmunity MEMories. In: Proc. of the 21st International Conference Companion on World Wide Web, pp. 275–278 (2012)

  43. Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the web. CoRR (2009). arXiv:0911.1112

  44. Burner, M., Kahle, B.: Arc file format (1996). http://www.archive.org/web/researcher/ArcFileFormat.php. Accessed Sept 1996

  45. NDSA Content Working Group: Web archiving survey report. Technical Report, National Digital Stewardship Alliance (2012)

  46. Bailey, J., Grotke, A., Hanna, K., Hartman, C., McCain, E., Moffatt, C., Taylor, N.: Web archiving in the United States: a 2013 survey. Technical Report, National Digital Stewardship Alliance (2014)

  47. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proc. of the 11th Annual International ACM/IEEE joint Conference on Digital Libraries, pp. 133–136 (2011)

  48. AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)

    Article  Google Scholar 

  49. ISO 28500:2009: Information and documentation—WARC file format (2009). http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717. Accessed 30 Apr 2016

  50. IIPC: Internet Archive ARC access tools (2009). http://archive-access.sourceforge.net/. Accessed 30 Apr 2016

Download references

Acknowledgments

This work could not have been done without the support of the Portuguese Web Archive team. We also thank FCT for the financial support of the Research Units of LaSIGE (PEst-OE/EEI/UI0408/2014) and INESC-ID (UID/CEC/50021/2013), and the DataStorm Research Line of Excellency (EXCL/EEI-ESS/0257/2012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miguel Costa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Costa, M., Gomes, D. & Silva, M.J. The evolution of web archiving. Int J Digit Libr 18, 191–205 (2017). https://doi.org/10.1007/s00799-016-0171-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-016-0171-9

Keywords

Navigation