AnswerCult

Question

256 viewsDecember 31, 2023

Question 100.55K July 28, 2023 0 Comments

How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

In: 8

2 Answers

Answer 1 · 2023-07-28T22:39:45+00:00

The Wayback Machine does not scan pages itself. People go along and enter a URL for the Wayback Machine to archive. Wikipedia editors, for instance, will sometimes use the Wayback Machine if they want to cite a webpage as a source but they suspect that webpage will not stay up for a very long.
You can enter a URL into the site and see if it has been archived. And you can also make an archive version of that page yourself.

Answer 2 · 2023-07-29T00:07:09+00:00

They use crawlers that index pages and follow links within those pages to find other pages, then repeat the process with those pages. This is the same thing that search engines do. There’s no centralized list of every site on the internet, if I create a webpage today nobody else knows it exists. So if a page isn’t linked to from anywhere, it won’t be found.

You can submit URLs to be archived manually if there’s a site you know is missing.

Also, website operators can ask for their site not to be crawled with a file called robots.txt and I think the Internet Archive does respect that – if the website owner doesn’t want the site crawled, it won’t be in the archive.

AnswerCult

How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

2 Answers

Search questions

Popular Questions

Latest Answers