Five Things to Know About Web Data Extraction Now
Data is key to competitive advantage. If you’re a regular reader of this publication, you won’t dispute that. The question then becomes: Where do you get your data? While the option to simply buy external data from data marketplaces is growing in popularity, you also have the ability to harvest your own data from the World Wide Web using a variety of methods.
The Web data extraction business has come a long way from its early days. In fact, you might be surprised at how common and sophisticated the practices have become. Whether it’s monitoring the prices of goods on Walmart.com or tracking consumer sentiment on product reviews, the use cases for scraped data are almost as varied as the data.
Here are five things you may not know about the current state of Web data extraction:
1. Web Scraping Is Growing in Popularity
Everybody gets a little bit of data from the Web. But when we’re talking about Web data extraction–commonly known as Web–we’re not talking about cutting and pasting some product information to a Google Sheet. We’re referring to large-scale, programmatic efforts to harvest data from the Web. In effect, Web scraping tools become the API for accessing data from the Internet.
Web scraping has grown both in terms of the number or organizations doing it and the scale of the scraping projects undertaken by participants, says Shane Evans, the CEO of Zyte, a provider of open source and proprietary Web scraping offerings.
“I think in recent years we’ve seen this done on a much, much larger scale,” Evans says. “And I’d say that probably every large, data-driven company is using Web data in some way.”
Evans cited a recent report by Opimas that forecast the market for Web scraping tools in the US would grow by 23% in 2022, reaching $5 billion in revenue. The most common use case for Web scraping is capturing pricing, as dynamic pricing has come to dominate ecommerce, according to Optimas’ survey. That’s followed by economic and investment research, sentiment analysis, brand protection, sales and marketing, and account aggregation.
As organizations get more mature with their Web scraping initiatives, the appetite for fresh Web data seems to grow, says Evans, who delivered a keynote address at the Web Data Extraction Summit 2022 last month.
“What we’re seeing more recently is a lot of these large companies are really doubling down on their capability to get real-time insights, or get up-to-date insights on large volumes of data,” he tells Datanami. “They’re able to distill all this Web data into insights in ways they weren’t able to before, and their appetite for Web data has gone way up.”
2. Tools Growing in Sophistication
There’s a wide range of tools available to the prospective Web scraper. Finding the right one for you will require you to assess your needs, as well as your budget.
On the one hand, there are free and open source tools such as Scrapy (formerly Scrapinghub), the Python-based Web crawling framework that Evans created in 2007 and released as open source a couple of years later. Scrapy serves more than 8 billion data requests per month, Evans says.
Other open source Web scraping tools include Java-based tools like Heritrix and Web-Harvest; Python-based MechanicalSoup; Apify SDK, built in JavaScript; and Apache Nutch, the Java-based scraper that was the processor to Apache Hadoop.
Evans likens Scrapy to a Web application framework that already has most of the software infrastructure required to embark upon a Web scraping project. Users can schedule the scraping processes to run using an open source scheduler like Chron; the data is typically delivered in JSON.
“This takes care of all the plumbing, all the infrastructure you need, and you basically just write a little bit of specific code that you need to understand the content of a specific website,” the Irishman says. “It will manage all of these. It will manage the out of the box integration with third party systems. And a lot of the common tasks that you need are taking care of. It’s really very popular.”
As users get more sophisticated in their needs, they can move up to a hosted offering, such as Zyte’s Scrapy Cloud solution, which is not free and not open source. Its paid offerings will provide simpler configuration and monitoring capabilities. Websites often block or ban Web crawlers and spiders, so enterprise solutions typically have techniques for getting around those impediments.
Zyte’s Scrapy Cloud also uses machine learning techniques to bring more automation to industrial-scale Web scraping operations, in which data is stored in its Hadoop cluster. However, there’s a tradeoff in quality that customers typically make when going from deterministic, hard-coded Web scraping approaches to probabilistic, ML-based Web scraping. But at the upper echelons of data extraction volume, it’s the only way to obtain the data.
3. Scraped Data May Be Protected
It’s tempting to view the Internet as one giant pool of actionable data that’s just waiting for some enterprising organization to harvest and monetize to its heart’s delight. But the truth is that, while the data may exist naturally in the clear, you don’t have the legal right to do anything you want with it.
“A lot of people believe that because data is public, it’s fine to scrape,” Evans says. “But it’s still subject to GDPR. That means it may not necessarily be okay. You have to have a lawful basis for scraping it.”
Most of the time, when Zyte conducts an assessment to determine legality of a prospective Web scraping contract, it concludes that it would not be able to do it in accordance with GDPR. Only occasionally does the company actually help customers scrape data when the data is subject to GDPR, Evans says.
“None of our own extraction APIs will support any of this [PII] information unless we go through a legal check on the use case,” he says. “If we’re going to do something like sentiment analysis, usually it would be on anonymized data.”
Similarly, social media websites are also mostly off limits to Web data scraping outfits. However, the greater clarity on the legality of what you can and cannot do with data may actually be driving growth in Web scraping, Evans says.
“Many years ago, people would have worried a bit more about compliance, because it was just less clear how to interpret legislation that was never written with this in mind,” he says. “Some of it wasn’t even written with the Internet in mind, for Web data extraction, and where some of the boundaries around what’s okay and what’s not OK are. But that’s become much clearer in recent years.”
4. Cleanliness and Reliability Are Big Issues
The quality of data is a major impediment for AI and machine learning projects. It turns out data cleanliness and reliability are also big stumbling blocks for particpants in the Web scraping business, too.
“Quality and reliability of data would be two of the most two important things for our customers,” Evans says. “Clearly, if you’re using it for machine learning or using it to base decisions on, you want it to be correct. But equally, if you’re making decision on an ongoing basis, actually reliably doing Web data at scale can be challenging.”
When an organization scrapes a Web page, they typically take the whole page, as opposed to extracting individual fields on the page. Depending on the type of Web page, the customer will create or select a schema that has 20 or 30 pre-determined fields, such as articles, products, events, or jobs. JSON, with its nested structure, works well for storing the extracted page and its individual data elements.
But as anybody who has perused the Web for more than a day knows, website are constantly changing. Keeping up with these changes is one of the challenges that Web scrapers face.
Dynamic websites pose a similar dilemma. The Web isn’t just a bunch of static content anymore, so in some cases, Zyte’s crawlers will fill in forms, such as to determine the delivery time delay for a food delivery service, for example.
“It can be very error-prone, if you’re not careful,” Evans says of Web scraping. “ It needs constant monitoring and quality in place to ensure that it stays reliable and stays accurate.”
5. Web Data Can Provide a Strategic Advantage
As third-party cookies go away and competition heats up in the data marketplace, the option for harvesting ones own data will undoubtedly appeal to a growing audience.
“It’s become more a strategic priority for many companies. It’s getting a bit more visibility, especially in the larger companies among more senior people, around the C-suite or even the board,” he says. “The ability to gather this data and actually derive some of these insights is becoming a competitive advantage.”
The bulk of the spending for Web extraction projects is currently spent internally, according to the Opimas report. However, spending on external solutions is growing at more than twice the rate than internal, with 40% year-over-year growth in external Web scraping services compared to 18% for internal services. That’s another indicator that the Web data scraping industry is set to take off, according to Evans.
“Right now, about three-quarters of the spend of companies is internal. So they’re still doing a lot of this inhouse in their own teams,” he says. “But over time, that’s changing and they’re increasingly relying on external experts instead of doing this inhouse…Because it’s getting harder, because the tools are getting better, they’re increasingly spending more money externally.”
Related Items:
The Death of Third-Party Cookies and the Rise of Zero-Party Data
Data Is Everywhere, But Harvest Your Own for Peak AI Performance