[ELI5] What is data scraping and why is it bad?

401 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

Data scraping is simply finding data – no matter what it is – out on the internet intended for human consumption, but having a bot (ie. software) collect the data instead. “Data” could be actual tables of information like weather reports, or it could be news articles on the front page of the New York Times.

Why is it bad? Well, this isn’t the intended use of the information on the web site. There are “better” ways for people to get the data, and maybe those methods are intended to be behind some kind of pay-wall. Using the human-intended version of a web page probably serves ads, which the bot won’t load and display. Statistics tracking on the bot is going to throw off said statistics compared to humans.

It’s one of those weird edge cases whose morality is questionable. The information is generally available, but it wasn’t intended to be slurped up by a bot at high speed and now people are disturbed by it, from bots loading more data than humans, to the fact that you’re trying to avoid the pay-wall, to trying to collect data to build your own database for your own purposes.

Anonymous 0 Comments

Data scraping is basically using a piece of software to read the contents of a website and store it into a database. A scraper could load the Reddit front page, then go into each post on the front page, read all the comments, and store the contents of those contents, including the userid of the person that wrote it, into a database.

There isn’t anything inherently “bad” about data scraping as a technique. It’s at the root of how pretty much all Internet search engines operate, for example. Where the concern comes in is the potential privacy implications. For example, someone could analyze writing styles and match up pseudonyms on Reddit with real people and use that to deanonymize them and build up in depth profiles of people they can sell to scammers, marketers, etc.

Anonymous 0 Comments

Big tech companies don’t like data scraping because it’s a direct competitor to what they do behind the scenes. Facebook, Google, Twitter, and just about any free technology is scraping your data to make money.

Anonymous 0 Comments

I’m pretty sure you’re asking with respect to Twitter/Reddit. They don’t like it for a slightly different reason than the current good answer by DeHackEd.

Reddit and Twitter show ads interspersed with the conversations people are having. Those ads are how they make money (technically in both cases it’s how they lose less money.) The amount advertisers are willing to pay depends on how many real customers they believe will see the ads.

So that’s part of why they want to charge for access to their sites’ “API”. That’s how people write programs to get at the site’s data without going through a web page. This is a little faster but, more importantly, doesn’t come with ads since it’s not a human staring at a web page. Advertisers don’t want to pay to send ad banners to bots.

But a “scraper” doesn’t use the API. It’s a program that tries its best to look like it’s a human using a browser to ask for a web page, then it “scrapes” the HTML of the web page to try to find the data. Since this isn’t using API, it won’t be paying money. But also since it’s not a human, it’s not really looking at the ads.

If the advertisers think there are a lot of scrapers, they’re going to be upset. Every time an ad is served counts as “an impression” and they pay based on how many of those happen. But if they think a big percentage of those “impressions” are just scrapers, they’re going to demand that they shouldn’t pay so much because they’re not reaching as many people as the numbers imply. Depending on how ads are served, the advertisers might have a way to use their own anti-scraping code to try and figure out how many views aren’t “real”.

So scrapers access the data of Reddit/Twitter but do not represent ad revenue. If there are a lot of them, they can also drive down how much people are willing to pay for ads on the platform.

One thing Elon Musk got right is that for public-facing sites that do not require login, it’s VERY difficult to detect and stop scrapers. The reason a login is important is it gives the site a more reliable way to look at activity than just the IP address, and they might decide an account that visits 100 pages in 5 minutes is most definitely suspicious and suspend it. This is still a cat and mouse game as it’s pretty easy for scraper-people to get a lot of accounts created quickly. I was reading about shops where people made cents per Facebook account they created and they went as far as installing new SIMs into burner phones so they could set up 2FA and seem more legit. There is a LOT of money in doing nasty things with large amounts of accounts on social media, from misinformation to ad fraud.

It’s just real suspicious that even though he claimed bots were a problem a year ago, *suddenly* they were so bad they broke his site the same day the contract expired for a bill he didn’t pay.

Anonymous 0 Comments

Most of the time when you want to gather information from a website to use in a coding project, say, if you wanted to display the top 10 posts from your business’s Facebook page on your website, you would get that information directly from Facebook using their API.

However, if a website doesn’t make the information you want available through an api, or if you’re trying to gather mass amounts of data from all over the web, you can go the higher effort route and straight up write a program to load the webpage just like your browser would, then look at the page and extract all the text/images/other stuff you want. That is data scraping, and big companies like Reddit don’t like it because it means people are able to get around paying to use their API.

Data scraping on its own isn’t bad, but it’s the main technique that’s been used to gather the huge quantities of data that are need to train AI models like ChatGPT and Midjourney. Since data scraping bots don’t exactly stop to check the copyright license of whatever text/images/other stuff they are gathering, and it’s very difficult to go back and check once a piece of data has been separated from the original context from which it was scraped, AI companies that use them have been getting away with borderline copyright infringement under the excuse that “they cannot possibly know whether a piece of content is copyrighted or not.”

Anonymous 0 Comments

*I posted the below as a response to another comment that has since been deleted, so if the context seems weird that’s why. But essentially data scraping is the act of programmatically obtaining publicly-accessible content from a website that was intended for human consumption, and the below describes why it’s bad within the context of recent changes on several social media platforms (not every use of data scraping falls under this description).*

It’s essentially intellectual property theft. Hosting a website costs money, aside from the technical infrastructure costs of literally hosting the servers, a massive company like Twitter or Reddit needs a whole suite of employees to manage everything from legal compliance and moderation to accounting, project management, user psychology etc.

For an automated program to take that content – consuming resources but without contributing anything of value or even being a valid target for advertising that a regular human user would – costs the platform money. Allowing it at all is merely a balancing act of ensuring your platform is useful and open enough to sustain a large userbase, while also ensuring the cost doesn’t outweigh the benefit.

The fact that AI companies are taking this content – without permission, although that’s a legal grey area that may or may not breach the terms of service – and applying it in ways for their own benefit that are directly competitive to the platform and will eventually kill it off, is the problem. The platforms are handling this poorly because the usage and knowledge of Large Language Models has really boomed suddenly in the past 6-12 months, there’s a lot of uncertainty and we don’t know a lot of the potential dangers, but platforms like Twitter and Reddit know it’s not going to end well for them.

You can argue all day about whether Reddit is charging a fair price for their API compared to what it physically costs to make the request, or whether Twitter’s post reading limits make sense, and that’s a fair argument for use cases like regular user behaviour or third party apps.

But the threat these platforms are trying to prevent – AI/LLMs – are a bigger existential problem than just the cost to send some data. Why they have gone about restricting it so poorly, in a way that doesn’t target this threat specifically, I don’t know. I mean for the short term you could easily have a ToS that just forbids content being used to train any AI, it would work fine until they come up with a better solution. But instead they’ve made sweeping changes that affect a far larger audience, that only seems to damage their reputation and hasten their demise.

0 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

Data scraping is simply finding data – no matter what it is – out on the internet intended for human consumption, but having a bot (ie. software) collect the data instead. “Data” could be actual tables of information like weather reports, or it could be news articles on the front page of the New York Times.

Why is it bad? Well, this isn’t the intended use of the information on the web site. There are “better” ways for people to get the data, and maybe those methods are intended to be behind some kind of pay-wall. Using the human-intended version of a web page probably serves ads, which the bot won’t load and display. Statistics tracking on the bot is going to throw off said statistics compared to humans.

It’s one of those weird edge cases whose morality is questionable. The information is generally available, but it wasn’t intended to be slurped up by a bot at high speed and now people are disturbed by it, from bots loading more data than humans, to the fact that you’re trying to avoid the pay-wall, to trying to collect data to build your own database for your own purposes.

Anonymous 0 Comments

Data scraping is basically using a piece of software to read the contents of a website and store it into a database. A scraper could load the Reddit front page, then go into each post on the front page, read all the comments, and store the contents of those contents, including the userid of the person that wrote it, into a database.

There isn’t anything inherently “bad” about data scraping as a technique. It’s at the root of how pretty much all Internet search engines operate, for example. Where the concern comes in is the potential privacy implications. For example, someone could analyze writing styles and match up pseudonyms on Reddit with real people and use that to deanonymize them and build up in depth profiles of people they can sell to scammers, marketers, etc.

Anonymous 0 Comments

Big tech companies don’t like data scraping because it’s a direct competitor to what they do behind the scenes. Facebook, Google, Twitter, and just about any free technology is scraping your data to make money.

Anonymous 0 Comments

I’m pretty sure you’re asking with respect to Twitter/Reddit. They don’t like it for a slightly different reason than the current good answer by DeHackEd.

Reddit and Twitter show ads interspersed with the conversations people are having. Those ads are how they make money (technically in both cases it’s how they lose less money.) The amount advertisers are willing to pay depends on how many real customers they believe will see the ads.

So that’s part of why they want to charge for access to their sites’ “API”. That’s how people write programs to get at the site’s data without going through a web page. This is a little faster but, more importantly, doesn’t come with ads since it’s not a human staring at a web page. Advertisers don’t want to pay to send ad banners to bots.

But a “scraper” doesn’t use the API. It’s a program that tries its best to look like it’s a human using a browser to ask for a web page, then it “scrapes” the HTML of the web page to try to find the data. Since this isn’t using API, it won’t be paying money. But also since it’s not a human, it’s not really looking at the ads.

If the advertisers think there are a lot of scrapers, they’re going to be upset. Every time an ad is served counts as “an impression” and they pay based on how many of those happen. But if they think a big percentage of those “impressions” are just scrapers, they’re going to demand that they shouldn’t pay so much because they’re not reaching as many people as the numbers imply. Depending on how ads are served, the advertisers might have a way to use their own anti-scraping code to try and figure out how many views aren’t “real”.

So scrapers access the data of Reddit/Twitter but do not represent ad revenue. If there are a lot of them, they can also drive down how much people are willing to pay for ads on the platform.

One thing Elon Musk got right is that for public-facing sites that do not require login, it’s VERY difficult to detect and stop scrapers. The reason a login is important is it gives the site a more reliable way to look at activity than just the IP address, and they might decide an account that visits 100 pages in 5 minutes is most definitely suspicious and suspend it. This is still a cat and mouse game as it’s pretty easy for scraper-people to get a lot of accounts created quickly. I was reading about shops where people made cents per Facebook account they created and they went as far as installing new SIMs into burner phones so they could set up 2FA and seem more legit. There is a LOT of money in doing nasty things with large amounts of accounts on social media, from misinformation to ad fraud.

It’s just real suspicious that even though he claimed bots were a problem a year ago, *suddenly* they were so bad they broke his site the same day the contract expired for a bill he didn’t pay.

Anonymous 0 Comments

Most of the time when you want to gather information from a website to use in a coding project, say, if you wanted to display the top 10 posts from your business’s Facebook page on your website, you would get that information directly from Facebook using their API.

However, if a website doesn’t make the information you want available through an api, or if you’re trying to gather mass amounts of data from all over the web, you can go the higher effort route and straight up write a program to load the webpage just like your browser would, then look at the page and extract all the text/images/other stuff you want. That is data scraping, and big companies like Reddit don’t like it because it means people are able to get around paying to use their API.

Data scraping on its own isn’t bad, but it’s the main technique that’s been used to gather the huge quantities of data that are need to train AI models like ChatGPT and Midjourney. Since data scraping bots don’t exactly stop to check the copyright license of whatever text/images/other stuff they are gathering, and it’s very difficult to go back and check once a piece of data has been separated from the original context from which it was scraped, AI companies that use them have been getting away with borderline copyright infringement under the excuse that “they cannot possibly know whether a piece of content is copyrighted or not.”

Anonymous 0 Comments

*I posted the below as a response to another comment that has since been deleted, so if the context seems weird that’s why. But essentially data scraping is the act of programmatically obtaining publicly-accessible content from a website that was intended for human consumption, and the below describes why it’s bad within the context of recent changes on several social media platforms (not every use of data scraping falls under this description).*

It’s essentially intellectual property theft. Hosting a website costs money, aside from the technical infrastructure costs of literally hosting the servers, a massive company like Twitter or Reddit needs a whole suite of employees to manage everything from legal compliance and moderation to accounting, project management, user psychology etc.

For an automated program to take that content – consuming resources but without contributing anything of value or even being a valid target for advertising that a regular human user would – costs the platform money. Allowing it at all is merely a balancing act of ensuring your platform is useful and open enough to sustain a large userbase, while also ensuring the cost doesn’t outweigh the benefit.

The fact that AI companies are taking this content – without permission, although that’s a legal grey area that may or may not breach the terms of service – and applying it in ways for their own benefit that are directly competitive to the platform and will eventually kill it off, is the problem. The platforms are handling this poorly because the usage and knowledge of Large Language Models has really boomed suddenly in the past 6-12 months, there’s a lot of uncertainty and we don’t know a lot of the potential dangers, but platforms like Twitter and Reddit know it’s not going to end well for them.

You can argue all day about whether Reddit is charging a fair price for their API compared to what it physically costs to make the request, or whether Twitter’s post reading limits make sense, and that’s a fair argument for use cases like regular user behaviour or third party apps.

But the threat these platforms are trying to prevent – AI/LLMs – are a bigger existential problem than just the cost to send some data. Why they have gone about restricting it so poorly, in a way that doesn’t target this threat specifically, I don’t know. I mean for the short term you could easily have a ToS that just forbids content being used to train any AI, it would work fine until they come up with a better solution. But instead they’ve made sweeping changes that affect a far larger audience, that only seems to damage their reputation and hasten their demise.