How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

122 views

How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

In: 8

2 Answers

Anonymous 0 Comments

The Wayback Machine does not scan pages itself. People go along and enter a URL for the Wayback Machine to archive. Wikipedia editors, for instance, will sometimes use the Wayback Machine if they want to cite a webpage as a source but they suspect that webpage will not stay up for a very long.
You can enter a URL into the site and see if it has been archived. And you can also make an archive version of that page yourself.

Anonymous 0 Comments

They use crawlers that index pages and follow links within those pages to find other pages, then repeat the process with those pages. This is the same thing that search engines do. There’s no centralized list of every site on the internet, if I create a webpage today nobody else knows it exists. So if a page isn’t linked to from anywhere, it won’t be found.

You can submit URLs to be archived manually if there’s a site you know is missing.

Also, website operators can ask for their site not to be crawled with a file called robots.txt and I think the Internet Archive does respect that – if the website owner doesn’t want the site crawled, it won’t be in the archive.