How web crawlers and other engines don’t constantly get infected with viruses?


By constantly downloading random information from the internet, wouldn’t you be exposing yourself to tons of malicious content? Aren’t there pages that can run malware without you even clicking on anything?

A better example than search engines might be something like “the wayback machine”, a site that actually saves the pages, and not just links.

In: 12

6 Answers

Anonymous 0 Comments

When you click on a link the browser loads a lot of stuff, including page layout information (HTML & CSS), references to images, and Javascript code, which is like a program to do things. Normally the code communicates with the host platform to get information from databases and pass along things like passwords and emails, but that can be changed to do bad things.

Search engines don’t do that. The contents of the page are downloaded and scanned for text, links, and images, but no code is run. It’s like the difference between looking at directions on a map, and actually following those directions.

Anonymous 0 Comments

Think of it like the difference between photocopying a book and reading one. Your Web browser reads the page code and interprets it. Crawlers and things like the way back machine just copy the page code or specific bits in the code.

Anonymous 0 Comments

Because they just read and write, and don’t execute.
Kinda like copypasta

You can read a manual on how to hurt yourself physically without being harmed. Just acting on it is damaging you

Anonymous 0 Comments

No, not really. Modern browsers are pretty resilient, they generally don’t trust the code on the page, and limit its possible actions. Loopholes still happen, but they get patched quickly. This is the first line of defense.

Then, they run the crawler code on a restricted user account, so the operating system will refuse any access to system files. That’s the second line.

Finally, if the malicious code somehow finds a loophole in a browser, AND THEN a loophole in OS, they get to live – up until the next system wipe.

Anonymous 0 Comments

ELI5: You can pretty easily tell if something is a book right? So you are looking for something to read. Pick it up. Is it a book? No. Toss it. Yes? Read it.

Search engines do the same with everything they process. Malware can’t be embedded in a webpage, its a seperate executable downloaded by the page. So anytime the crawler reads something “is it a webpage?” No, toss it. Yes, process it, then find everything it links to, repeat.

Anonymous 0 Comments

a) zero day exploits really aren’t that common anymore – most viruses require a human to manually start them, just visiting a web site and clicking links won’t do it

b) most crawlers aren’t actually “looking” at most of the content, so they’d just move around the virus without actually being affected by it

c) any exploit would likely be targeted against common browsers – the environment of the crawler would be different and the exploit/virus likely wouldn’t work there, unless specifically targeting the crawler (and targeting the crawler is hard, because unlike the browser, it’s not public so you can’t easily test your attack)

d) if the operators have any common sense, the crawlers running inside a sandbox, so exploiting the crawler does nothing and the sandbox will be automatically destroyed and recreated from a clean version on a regular basis

e) targeting crawlers specifically would be a dangerous game: due to the sandboxing it’s not too valuable, but you’re exposing your (valuable) zero day to an environment that could be tightly monitored. If you get caught, your zero day will be fixed and become worthless.