[ELI5] What is data scraping and why is it bad?

300 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

*I posted the below as a response to another comment that has since been deleted, so if the context seems weird that’s why. But essentially data scraping is the act of programmatically obtaining publicly-accessible content from a website that was intended for human consumption, and the below describes why it’s bad within the context of recent changes on several social media platforms (not every use of data scraping falls under this description).*

It’s essentially intellectual property theft. Hosting a website costs money, aside from the technical infrastructure costs of literally hosting the servers, a massive company like Twitter or Reddit needs a whole suite of employees to manage everything from legal compliance and moderation to accounting, project management, user psychology etc.

For an automated program to take that content – consuming resources but without contributing anything of value or even being a valid target for advertising that a regular human user would – costs the platform money. Allowing it at all is merely a balancing act of ensuring your platform is useful and open enough to sustain a large userbase, while also ensuring the cost doesn’t outweigh the benefit.

The fact that AI companies are taking this content – without permission, although that’s a legal grey area that may or may not breach the terms of service – and applying it in ways for their own benefit that are directly competitive to the platform and will eventually kill it off, is the problem. The platforms are handling this poorly because the usage and knowledge of Large Language Models has really boomed suddenly in the past 6-12 months, there’s a lot of uncertainty and we don’t know a lot of the potential dangers, but platforms like Twitter and Reddit know it’s not going to end well for them.

You can argue all day about whether Reddit is charging a fair price for their API compared to what it physically costs to make the request, or whether Twitter’s post reading limits make sense, and that’s a fair argument for use cases like regular user behaviour or third party apps.

But the threat these platforms are trying to prevent – AI/LLMs – are a bigger existential problem than just the cost to send some data. Why they have gone about restricting it so poorly, in a way that doesn’t target this threat specifically, I don’t know. I mean for the short term you could easily have a ToS that just forbids content being used to train any AI, it would work fine until they come up with a better solution. But instead they’ve made sweeping changes that affect a far larger audience, that only seems to damage their reputation and hasten their demise.

You are viewing 1 out of 6 answers, click here to view all answers.
0 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

*I posted the below as a response to another comment that has since been deleted, so if the context seems weird that’s why. But essentially data scraping is the act of programmatically obtaining publicly-accessible content from a website that was intended for human consumption, and the below describes why it’s bad within the context of recent changes on several social media platforms (not every use of data scraping falls under this description).*

It’s essentially intellectual property theft. Hosting a website costs money, aside from the technical infrastructure costs of literally hosting the servers, a massive company like Twitter or Reddit needs a whole suite of employees to manage everything from legal compliance and moderation to accounting, project management, user psychology etc.

For an automated program to take that content – consuming resources but without contributing anything of value or even being a valid target for advertising that a regular human user would – costs the platform money. Allowing it at all is merely a balancing act of ensuring your platform is useful and open enough to sustain a large userbase, while also ensuring the cost doesn’t outweigh the benefit.

The fact that AI companies are taking this content – without permission, although that’s a legal grey area that may or may not breach the terms of service – and applying it in ways for their own benefit that are directly competitive to the platform and will eventually kill it off, is the problem. The platforms are handling this poorly because the usage and knowledge of Large Language Models has really boomed suddenly in the past 6-12 months, there’s a lot of uncertainty and we don’t know a lot of the potential dangers, but platforms like Twitter and Reddit know it’s not going to end well for them.

You can argue all day about whether Reddit is charging a fair price for their API compared to what it physically costs to make the request, or whether Twitter’s post reading limits make sense, and that’s a fair argument for use cases like regular user behaviour or third party apps.

But the threat these platforms are trying to prevent – AI/LLMs – are a bigger existential problem than just the cost to send some data. Why they have gone about restricting it so poorly, in a way that doesn’t target this threat specifically, I don’t know. I mean for the short term you could easily have a ToS that just forbids content being used to train any AI, it would work fine until they come up with a better solution. But instead they’ve made sweeping changes that affect a far larger audience, that only seems to damage their reputation and hasten their demise.

You are viewing 1 out of 6 answers, click here to view all answers.