[ELI5] What is data scraping and why is it bad?

294 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

Most of the time when you want to gather information from a website to use in a coding project, say, if you wanted to display the top 10 posts from your business’s Facebook page on your website, you would get that information directly from Facebook using their API.

However, if a website doesn’t make the information you want available through an api, or if you’re trying to gather mass amounts of data from all over the web, you can go the higher effort route and straight up write a program to load the webpage just like your browser would, then look at the page and extract all the text/images/other stuff you want. That is data scraping, and big companies like Reddit don’t like it because it means people are able to get around paying to use their API.

Data scraping on its own isn’t bad, but it’s the main technique that’s been used to gather the huge quantities of data that are need to train AI models like ChatGPT and Midjourney. Since data scraping bots don’t exactly stop to check the copyright license of whatever text/images/other stuff they are gathering, and it’s very difficult to go back and check once a piece of data has been separated from the original context from which it was scraped, AI companies that use them have been getting away with borderline copyright infringement under the excuse that “they cannot possibly know whether a piece of content is copyrighted or not.”

You are viewing 1 out of 6 answers, click here to view all answers.
0 views

[ELI5] What is data scraping and why is it bad?

In: 16

6 Answers

Anonymous 0 Comments

Most of the time when you want to gather information from a website to use in a coding project, say, if you wanted to display the top 10 posts from your business’s Facebook page on your website, you would get that information directly from Facebook using their API.

However, if a website doesn’t make the information you want available through an api, or if you’re trying to gather mass amounts of data from all over the web, you can go the higher effort route and straight up write a program to load the webpage just like your browser would, then look at the page and extract all the text/images/other stuff you want. That is data scraping, and big companies like Reddit don’t like it because it means people are able to get around paying to use their API.

Data scraping on its own isn’t bad, but it’s the main technique that’s been used to gather the huge quantities of data that are need to train AI models like ChatGPT and Midjourney. Since data scraping bots don’t exactly stop to check the copyright license of whatever text/images/other stuff they are gathering, and it’s very difficult to go back and check once a piece of data has been separated from the original context from which it was scraped, AI companies that use them have been getting away with borderline copyright infringement under the excuse that “they cannot possibly know whether a piece of content is copyrighted or not.”

You are viewing 1 out of 6 answers, click here to view all answers.