Crawling the Web for Data and Profit
The Web serves as a vast, renewable resource for the most valuable thing in existence: data. However, getting useful data from the Web isn’t always an easy task. Luckily, there are a handful of open source and commercial solutions that can help you crawl the Web to feed your hungry algorithms with the freshest data.
Not all data is created equal on the Web. Depending on what you’re looking for – news reports, stock prices, product availability, or social media – different techniques are required for clean and accurate extraction. For example, you may see a list of a type of restaurants in a given city on a Yelp result page, but more information – including the all-important user reviews, along with timestamps — requires drilling down into the actual restaurant pages.
Some Web crawlers are better at extracting certain types of data than others. such as Doug Cutting’s Apache Nutch, which served as the basis for Hadoop, was developed to crawl the Internet, so scalability is a big deal. However, other crawlers are more adept at retrieving data spread across different repositories, like the Yelp example from above.
One Web crawler developer, Import.io, touts its capability to “power business with alternative data.” The company’s crawler sports machine learning technology that helps return relevant results to users. If the results are not to the users liking, the crawler can automatically tweak its approach to get more or better data.
Import.io also offers more advanced features, such as the capability to merge data from multiple sources and create a common schema that lets the user do more with it. The company also offers some reporting and visualization functions, such as comparison reports that show how things have changed.
Another popular Web crawler is DeepCrawl. The product emerged from a desire to archive a large website with millions of flaws. When the first attempt to crawl the website resulted in a courier hand-delivering a hard drive, the company realized it needed to build its own. Today DeepCrawl is used most often with search engine optimization (SEO) endevors, which is a common use for Web crawlers.
However, not all Web crawlers are focused on marketing. One of the newer firms generating a buzz in the Web crawling field is Webhose.io. The Israeli company has been crawling and indexing content on specific types of Web sites, including news and blog sites and e-commerce sites, and making the data available to customers via an API, since 2014.
Ran Geva, Webhose’s co-founder and CEO, said the data his company captures off the Web helps customers make better decisions with their own products and services. “Basically we’re a Web data provider,” he said in a recent interview with Datanami. “We crawl the Web and try to draw as much of the Web as we can 24/7 and turn it into a machine readable format.”
Webhose does its best to automatically clean up the data that it archives from news and blog sources. Dealing with formatting issues can be difficult, and even something as seemingly simple as getting a publishing data can be a challenge. “Going back and querying for news blogs from 3 years ago and getting the clean version by date — I think we’re the only solution to be able to do that,” he says.
The Webhose platform, which includes various Python-based crawlers and components of the Elasticsearch ELK stack, includes built-in reporting capability that allows customers to set filters for their queries. So if a customer wants to only search news or blog sites that have a minimum of 50 participants or 100 likes on social media, Geva said, they can configure that from the Web-based interface.
The data gathered by Webhose is often used for analyses. “There’s a lot of data that you can later feed into your own system and run very granular type of analysis on top of the data,” Geva said.
“For example, we have financial companies that are using our news and blogs archive to train their machine learning algorithms to correlate information mentioned about companies and their stock performance to try to predict trends for the future,” he continues. “So if you have a three to four year overview and a theory about what causes what, then you can check it on past events.”
The company does its best to monitor e-commerce sites to track the pricing of products and stock availability. It currently tracks price and availabitly of 20 million products across thousands of websites. However, the task is made more difficult because some e-commerce sites, like Amazon.com, thwart crawlers by throttling them, Geva said.
While the company has a nearly four-year archive of millions of websites, the company keeps only 30 days’ worth of data live on its system, in order to guarantee fast response times. Queries going back years and touching millions of records often take hours, if not days, to complete, Geva said.
Webhose offers its service for free, but only to a point. Students and small businesses can get free access to use 1,000 API calls per month, each of which returns 100 results, giving them the capability to gather data from 100,000 hosts. For activity above that level, subscriptions start at $50 per month.
It’s a small price to pay for live Web data that would otherwise be very difficult for comaines to obtain, Geva said. “Many customers want access to the archive, but they don’t want to pay,” he said. Maintaining and updating the archive with is “very expensive,” he says, which is why some companies try to do it themselves. Big mistake, he says.
“If you want to collect data from one to 10 sites, you can do it yourself,” he says. “But if you want to do it at scale, it’s almost impossible. You have to invest into a lot of time and money….they’re basically re-inventing the wheel. They’re building their own solution, re-crawling the same sites, downloading the same data and they’re wasting a lot of time and effort.”
Recently Webhose unveiled an API for crawling the Dark Web, the section of the Internet that’s accessed through the Tor browser to protect people’s anonymity. It took some special development work to get Webhose’s proprietary crawler to work with Dark Web sites, which are constantly going up and down.
To get access to protected area of the Dark Web, which some cybersecurity firms desire, Webhose had to figure out how a way to get its crawlers past the Captcha systems that Dark Web hosters use to keep out such crawlers.
“Whatever you need to do to get data from the Dark Web,” Geva shrugged. “There are deeper places where they need to vet you — we’re not there just yet. But for all the rest of the marketplace and the message boards that are password protected, we can provide this data to either brands or cybersecurity companies that can shed some light on the Dark Web.”