How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

128 views

How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

In: 8

2 Answers

Anonymous 0 Comments

They use crawlers that index pages and follow links within those pages to find other pages, then repeat the process with those pages. This is the same thing that search engines do. There’s no centralized list of every site on the internet, if I create a webpage today nobody else knows it exists. So if a page isn’t linked to from anywhere, it won’t be found.

You can submit URLs to be archived manually if there’s a site you know is missing.

Also, website operators can ask for their site not to be crawled with a file called robots.txt and I think the Internet Archive does respect that – if the website owner doesn’t want the site crawled, it won’t be in the archive.

You are viewing 1 out of 2 answers, click here to view all answers.