[ELI5] What is data scraping and why is it bad?
In: 16
Data scraping is basically using a piece of software to read the contents of a website and store it into a database. A scraper could load the Reddit front page, then go into each post on the front page, read all the comments, and store the contents of those contents, including the userid of the person that wrote it, into a database.
There isn’t anything inherently “bad” about data scraping as a technique. It’s at the root of how pretty much all Internet search engines operate, for example. Where the concern comes in is the potential privacy implications. For example, someone could analyze writing styles and match up pseudonyms on Reddit with real people and use that to deanonymize them and build up in depth profiles of people they can sell to scammers, marketers, etc.
Big tech companies don’t like data scraping because it’s a direct competitor to what they do behind the scenes. Facebook, Google, Twitter, and just about any free technology is scraping your data to make money.
I’m pretty sure you’re asking with respect to Twitter/Reddit. They don’t like it for a slightly different reason than the current good answer by DeHackEd.
Reddit and Twitter show ads interspersed with the conversations people are having. Those ads are how they make money (technically in both cases it’s how they lose less money.) The amount advertisers are willing to pay depends on how many real customers they believe will see the ads.
So that’s part of why they want to charge for access to their sites’ “API”. That’s how people write programs to get at the site’s data without going through a web page. This is a little faster but, more importantly, doesn’t come with ads since it’s not a human staring at a web page. Advertisers don’t want to pay to send ad banners to bots.
But a “scraper” doesn’t use the API. It’s a program that tries its best to look like it’s a human using a browser to ask for a web page, then it “scrapes” the HTML of the web page to try to find the data. Since this isn’t using API, it won’t be paying money. But also since it’s not a human, it’s not really looking at the ads.
If the advertisers think there are a lot of scrapers, they’re going to be upset. Every time an ad is served counts as “an impression” and they pay based on how many of those happen. But if they think a big percentage of those “impressions” are just scrapers, they’re going to demand that they shouldn’t pay so much because they’re not reaching as many people as the numbers imply. Depending on how ads are served, the advertisers might have a way to use their own anti-scraping code to try and figure out how many views aren’t “real”.
So scrapers access the data of Reddit/Twitter but do not represent ad revenue. If there are a lot of them, they can also drive down how much people are willing to pay for ads on the platform.
One thing Elon Musk got right is that for public-facing sites that do not require login, it’s VERY difficult to detect and stop scrapers. The reason a login is important is it gives the site a more reliable way to look at activity than just the IP address, and they might decide an account that visits 100 pages in 5 minutes is most definitely suspicious and suspend it. This is still a cat and mouse game as it’s pretty easy for scraper-people to get a lot of accounts created quickly. I was reading about shops where people made cents per Facebook account they created and they went as far as installing new SIMs into burner phones so they could set up 2FA and seem more legit. There is a LOT of money in doing nasty things with large amounts of accounts on social media, from misinformation to ad fraud.
It’s just real suspicious that even though he claimed bots were a problem a year ago, *suddenly* they were so bad they broke his site the same day the contract expired for a bill he didn’t pay.
Most of the time when you want to gather information from a website to use in a coding project, say, if you wanted to display the top 10 posts from your business’s Facebook page on your website, you would get that information directly from Facebook using their API.
However, if a website doesn’t make the information you want available through an api, or if you’re trying to gather mass amounts of data from all over the web, you can go the higher effort route and straight up write a program to load the webpage just like your browser would, then look at the page and extract all the text/images/other stuff you want. That is data scraping, and big companies like Reddit don’t like it because it means people are able to get around paying to use their API.
Data scraping on its own isn’t bad, but it’s the main technique that’s been used to gather the huge quantities of data that are need to train AI models like ChatGPT and Midjourney. Since data scraping bots don’t exactly stop to check the copyright license of whatever text/images/other stuff they are gathering, and it’s very difficult to go back and check once a piece of data has been separated from the original context from which it was scraped, AI companies that use them have been getting away with borderline copyright infringement under the excuse that “they cannot possibly know whether a piece of content is copyrighted or not.”
[ELI5] What is data scraping and why is it bad?
In: 16
Data scraping is simply finding data – no matter what it is – out on the internet intended for human consumption, but having a bot (ie. software) collect the data instead. “Data” could be actual tables of information like weather reports, or it could be news articles on the front page of the New York Times.
Why is it bad? Well, this isn’t the intended use of the information on the web site. There are “better” ways for people to get the data, and maybe those methods are intended to be behind some kind of pay-wall. Using the human-intended version of a web page probably serves ads, which the bot won’t load and display. Statistics tracking on the bot is going to throw off said statistics compared to humans.
It’s one of those weird edge cases whose morality is questionable. The information is generally available, but it wasn’t intended to be slurped up by a bot at high speed and now people are disturbed by it, from bots loading more data than humans, to the fact that you’re trying to avoid the pay-wall, to trying to collect data to build your own database for your own purposes.
Data scraping is basically using a piece of software to read the contents of a website and store it into a database. A scraper could load the Reddit front page, then go into each post on the front page, read all the comments, and store the contents of those contents, including the userid of the person that wrote it, into a database.
There isn’t anything inherently “bad” about data scraping as a technique. It’s at the root of how pretty much all Internet search engines operate, for example. Where the concern comes in is the potential privacy implications. For example, someone could analyze writing styles and match up pseudonyms on Reddit with real people and use that to deanonymize them and build up in depth profiles of people they can sell to scammers, marketers, etc.
Big tech companies don’t like data scraping because it’s a direct competitor to what they do behind the scenes. Facebook, Google, Twitter, and just about any free technology is scraping your data to make money.
I’m pretty sure you’re asking with respect to Twitter/Reddit. They don’t like it for a slightly different reason than the current good answer by DeHackEd.
Reddit and Twitter show ads interspersed with the conversations people are having. Those ads are how they make money (technically in both cases it’s how they lose less money.) The amount advertisers are willing to pay depends on how many real customers they believe will see the ads.
So that’s part of why they want to charge for access to their sites’ “API”. That’s how people write programs to get at the site’s data without going through a web page. This is a little faster but, more importantly, doesn’t come with ads since it’s not a human staring at a web page. Advertisers don’t want to pay to send ad banners to bots.
But a “scraper” doesn’t use the API. It’s a program that tries its best to look like it’s a human using a browser to ask for a web page, then it “scrapes” the HTML of the web page to try to find the data. Since this isn’t using API, it won’t be paying money. But also since it’s not a human, it’s not really looking at the ads.
If the advertisers think there are a lot of scrapers, they’re going to be upset. Every time an ad is served counts as “an impression” and they pay based on how many of those happen. But if they think a big percentage of those “impressions” are just scrapers, they’re going to demand that they shouldn’t pay so much because they’re not reaching as many people as the numbers imply. Depending on how ads are served, the advertisers might have a way to use their own anti-scraping code to try and figure out how many views aren’t “real”.
So scrapers access the data of Reddit/Twitter but do not represent ad revenue. If there are a lot of them, they can also drive down how much people are willing to pay for ads on the platform.
One thing Elon Musk got right is that for public-facing sites that do not require login, it’s VERY difficult to detect and stop scrapers. The reason a login is important is it gives the site a more reliable way to look at activity than just the IP address, and they might decide an account that visits 100 pages in 5 minutes is most definitely suspicious and suspend it. This is still a cat and mouse game as it’s pretty easy for scraper-people to get a lot of accounts created quickly. I was reading about shops where people made cents per Facebook account they created and they went as far as installing new SIMs into burner phones so they could set up 2FA and seem more legit. There is a LOT of money in doing nasty things with large amounts of accounts on social media, from misinformation to ad fraud.
It’s just real suspicious that even though he claimed bots were a problem a year ago, *suddenly* they were so bad they broke his site the same day the contract expired for a bill he didn’t pay.
Most of the time when you want to gather information from a website to use in a coding project, say, if you wanted to display the top 10 posts from your business’s Facebook page on your website, you would get that information directly from Facebook using their API.
However, if a website doesn’t make the information you want available through an api, or if you’re trying to gather mass amounts of data from all over the web, you can go the higher effort route and straight up write a program to load the webpage just like your browser would, then look at the page and extract all the text/images/other stuff you want. That is data scraping, and big companies like Reddit don’t like it because it means people are able to get around paying to use their API.
Data scraping on its own isn’t bad, but it’s the main technique that’s been used to gather the huge quantities of data that are need to train AI models like ChatGPT and Midjourney. Since data scraping bots don’t exactly stop to check the copyright license of whatever text/images/other stuff they are gathering, and it’s very difficult to go back and check once a piece of data has been separated from the original context from which it was scraped, AI companies that use them have been getting away with borderline copyright infringement under the excuse that “they cannot possibly know whether a piece of content is copyrighted or not.”
Data scraping is simply finding data – no matter what it is – out on the internet intended for human consumption, but having a bot (ie. software) collect the data instead. “Data” could be actual tables of information like weather reports, or it could be news articles on the front page of the New York Times.
Why is it bad? Well, this isn’t the intended use of the information on the web site. There are “better” ways for people to get the data, and maybe those methods are intended to be behind some kind of pay-wall. Using the human-intended version of a web page probably serves ads, which the bot won’t load and display. Statistics tracking on the bot is going to throw off said statistics compared to humans.
It’s one of those weird edge cases whose morality is questionable. The information is generally available, but it wasn’t intended to be slurped up by a bot at high speed and now people are disturbed by it, from bots loading more data than humans, to the fact that you’re trying to avoid the pay-wall, to trying to collect data to build your own database for your own purposes.