AnswerCult

Question

262 viewsDecember 31, 2023

Question 100.55K July 28, 2023 0 Comments

How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

In: 8

2 Answers

You are viewing 1 out of 2 answers, click here to view all answers.

Answer 1 · 2023-07-29T00:07:09+00:00

They use crawlers that index pages and follow links within those pages to find other pages, then repeat the process with those pages. This is the same thing that search engines do. There’s no centralized list of every site on the internet, if I create a webpage today nobody else knows it exists. So if a page isn’t linked to from anywhere, it won’t be found.

You can submit URLs to be archived manually if there’s a site you know is missing.

Also, website operators can ask for their site not to be crawled with a file called robots.txt and I think the Internet Archive does respect that – if the website owner doesn’t want the site crawled, it won’t be in the archive.

AnswerCult

How exactly does the Wayback Machine know (decide?) which webpages to scan and at what times? Why are some pages archived while others not (excluding those with things like personal info, etc…)?

2 Answers

Search questions

Popular Questions

Latest Answers