[ELI5] What is data scraping and why is it bad?

302 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

I’m pretty sure you’re asking with respect to Twitter/Reddit. They don’t like it for a slightly different reason than the current good answer by DeHackEd.

Reddit and Twitter show ads interspersed with the conversations people are having. Those ads are how they make money (technically in both cases it’s how they lose less money.) The amount advertisers are willing to pay depends on how many real customers they believe will see the ads.

So that’s part of why they want to charge for access to their sites’ “API”. That’s how people write programs to get at the site’s data without going through a web page. This is a little faster but, more importantly, doesn’t come with ads since it’s not a human staring at a web page. Advertisers don’t want to pay to send ad banners to bots.

But a “scraper” doesn’t use the API. It’s a program that tries its best to look like it’s a human using a browser to ask for a web page, then it “scrapes” the HTML of the web page to try to find the data. Since this isn’t using API, it won’t be paying money. But also since it’s not a human, it’s not really looking at the ads.

If the advertisers think there are a lot of scrapers, they’re going to be upset. Every time an ad is served counts as “an impression” and they pay based on how many of those happen. But if they think a big percentage of those “impressions” are just scrapers, they’re going to demand that they shouldn’t pay so much because they’re not reaching as many people as the numbers imply. Depending on how ads are served, the advertisers might have a way to use their own anti-scraping code to try and figure out how many views aren’t “real”.

So scrapers access the data of Reddit/Twitter but do not represent ad revenue. If there are a lot of them, they can also drive down how much people are willing to pay for ads on the platform.

One thing Elon Musk got right is that for public-facing sites that do not require login, it’s VERY difficult to detect and stop scrapers. The reason a login is important is it gives the site a more reliable way to look at activity than just the IP address, and they might decide an account that visits 100 pages in 5 minutes is most definitely suspicious and suspend it. This is still a cat and mouse game as it’s pretty easy for scraper-people to get a lot of accounts created quickly. I was reading about shops where people made cents per Facebook account they created and they went as far as installing new SIMs into burner phones so they could set up 2FA and seem more legit. There is a LOT of money in doing nasty things with large amounts of accounts on social media, from misinformation to ad fraud.

It’s just real suspicious that even though he claimed bots were a problem a year ago, *suddenly* they were so bad they broke his site the same day the contract expired for a bill he didn’t pay.

You are viewing 1 out of 6 answers, click here to view all answers.
0 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

I’m pretty sure you’re asking with respect to Twitter/Reddit. They don’t like it for a slightly different reason than the current good answer by DeHackEd.

Reddit and Twitter show ads interspersed with the conversations people are having. Those ads are how they make money (technically in both cases it’s how they lose less money.) The amount advertisers are willing to pay depends on how many real customers they believe will see the ads.

So that’s part of why they want to charge for access to their sites’ “API”. That’s how people write programs to get at the site’s data without going through a web page. This is a little faster but, more importantly, doesn’t come with ads since it’s not a human staring at a web page. Advertisers don’t want to pay to send ad banners to bots.

But a “scraper” doesn’t use the API. It’s a program that tries its best to look like it’s a human using a browser to ask for a web page, then it “scrapes” the HTML of the web page to try to find the data. Since this isn’t using API, it won’t be paying money. But also since it’s not a human, it’s not really looking at the ads.

If the advertisers think there are a lot of scrapers, they’re going to be upset. Every time an ad is served counts as “an impression” and they pay based on how many of those happen. But if they think a big percentage of those “impressions” are just scrapers, they’re going to demand that they shouldn’t pay so much because they’re not reaching as many people as the numbers imply. Depending on how ads are served, the advertisers might have a way to use their own anti-scraping code to try and figure out how many views aren’t “real”.

So scrapers access the data of Reddit/Twitter but do not represent ad revenue. If there are a lot of them, they can also drive down how much people are willing to pay for ads on the platform.

One thing Elon Musk got right is that for public-facing sites that do not require login, it’s VERY difficult to detect and stop scrapers. The reason a login is important is it gives the site a more reliable way to look at activity than just the IP address, and they might decide an account that visits 100 pages in 5 minutes is most definitely suspicious and suspend it. This is still a cat and mouse game as it’s pretty easy for scraper-people to get a lot of accounts created quickly. I was reading about shops where people made cents per Facebook account they created and they went as far as installing new SIMs into burner phones so they could set up 2FA and seem more legit. There is a LOT of money in doing nasty things with large amounts of accounts on social media, from misinformation to ad fraud.

It’s just real suspicious that even though he claimed bots were a problem a year ago, *suddenly* they were so bad they broke his site the same day the contract expired for a bill he didn’t pay.

You are viewing 1 out of 6 answers, click here to view all answers.