How Mashable Seeds Journalism with Data Science
In journalism, researching what to write about often takes more time and effort than actually writing. But the online news site Mashable has found a way to accelerate the editorial process using a judicious application of big data tech and data science.
Mashable has come a long way since a 19-year-old Pete Cashmore started it in his bedroom in Scotland in 2006. At that time, Cashmore would devour all manner of social media and blogs with the goal of telling people “what they wanted to know next.” It was an entirely manual process, with Cashmore handling all aspects of the operation.
Today, Mashable has emerged as one of the most popular news sites for social trends, politics, and technology. In seeking to become “the media company for the Connected Generation and the voice of digital culture,” the company–ranked as the 266th most popular website in the United States by Alexa–has also leaned heavily on emerging technologies to help it swing above its weight.
Haile Owusu, chief data scientist at Mashable, recently gave Datanami the inside scoop on Mashable’s secret journalistic weapon, called Velocity, which help a team of 60 editors and reporters not only find topics to write about, but to actually help them write the stories, too.
Big Data Journos
Several years ago, Mashable started building Velocity to scour the Web and social media sites like Twitter and Facebook to find topics that that are trending. The idea was that if you could track early a story that’s starting to pop on Redit but isn’t yet trending widely, then Mashable’s editors can be more proactive and prepared to craft a story when it hits the mainstream.
Owusu was brought in to help extend Velocity to the next level by devising algorithms that could boost the predictive power of the program. Owusu and his team use a combination of natural language processing (NLP) and other machine learning methods to model huge amounts of content ingested into the system.
Based on this extensive corpus of archived stories, Velocity is essentially able to “learn” what a good story looks like–and perhaps more importantly, which ones will do well on social media.
“We have a large historical repository of content that we’ve crawled on the Web,” Owusu says. “A writer can see in real time as they’re writing, what are the nearest pieces in recent memory that correspond to that. The idea there is one can have a sense of where one’s competitive advantage might lie if one sees where one’s nearest competitors are for a given article that you’re writing.”
And Headlines, Too
As the dashboard evolved into a full-blown content management system (CMS) that Mashable editors and reporters use, the writers have learned to lean on the CMS to provide direction.
The system even helps with the editor’s ultimate challenge: writing clever headlines. “We’re actually able to, given a bunch of variants, to make on-the-fly choices about which is going to be better headline to show for that particular article,” Owusu says.
Velocity has emerged as a “multi-purpose predictive tool,” Jim Roberts, Mashable’s executive editor and chief content officer, said at the INMA’s Big Data for Media conference in 2015.
“Many people in the news business fear data because it will tell them they are doing something wrong,” Roberts said at that event, according to this International News Media Association story. “But I can assure you that data is our friend — in fact it is our lifeblood at Mashable.”
Tech Behind the Scenes
Mashable uses an array of big data and data science tools and technology to make the magic happen, including AWS clusters, Redshift warehouses, Big Query data stores, the Mathematica stats package from Wolfram, Jupyter data science notebooks, Beaker Labs notebooks, and Domino Data Lab‘s data science platform.
Owusu, whose background is in theoretical condensed matter physics, appreciates how Domino Data Lab creates a buffer between the data science work his team does and the actual engineering work of running the algorithms his team creates at scale.
“I’m not an engineer. I’m very comfortable coding, but I don’t spend a lot of time thinking about architecture, per se,” he says. “I’ve never been very comfortable living in AWS instances. I would say Domino Data Science abstracts what feels to me, as a non-engineer, as a lot of unnecessary nonsense.”
Owusu, who is more comfortable in R than the rest of his team (who are Pythonistas) also appreciates how Domino helps his team of four data scientists collaborate on research and algorithms.
“Domino has been, to be perfectly frank, the organizing principle behind our data science operations,” he says. “From basic exploratory analyses to more involved R&, and all the way up to deployment, Domino Data Labs is where we kind of live.”
With Velocity scouring the Web and identifying topics of interest, Mashable’s journalists have an inside view into popular culture, a leg up on competing rags that are still assimilating data manually.
Keeping up with the news is one thing, but keeping up with big data technology’s rapid evolution is something else. At Mashable, Owusu and his team are experimenting with newer technology like Apache Spark and Google TensorFlow framework to keep Velocity on the cutting edge.
“We’re really at the very beginnings of expanding our operations to include deep learning,” he says. “That’s a place we want to go.”
While some news outfits are actually letting the algorithms write the stories, don’t expect Mashable to go that far. The company does use algorithms to guide its editors and reporters, but it’s all done from within the context of maintaining journalistic integrity.
“We do not supplant the creative insight of writers,” he says. “What we try to offer is sufficient background so that in a highly competitive environment, a writer can in short order make decisions as to what constitutes the best story.”
September 13, 2019
- Duality Technologies Named a “Cool Vendor” by Gartner for Privacy Preservation in Analytics
- Valen Analytics Announces InsureRight Manage 2.0
- HPCwire and EnterpriseAI to Cover Silicon & Systems for Deep Learning at AI Hardware Summit as Headline Media Partners
September 12, 2019
- Odaseva Announces Growth with One Trillion Documents Supported and Over 10 Million Users
- IBM and ŠKODA AUTO University Collaborate on new Digital Skills for Students
- Trifacta Raises $100M to Support Explosive Growth of Data Wrangling for AI and the Cloud
- Snowflake and Fedresults Bring Cloud Smart Technology to Federal Government
- IBM Unveils z15 With Industry-First Data Privacy Capabilities
- StorageCraft Research Reveals Rampant Data Growth, and Inadequate IT Infrastructures are a Source of Global Concern and Risk
- Sumo Logic Accelerates Continuous Intelligence for Modern Enterprises with New Product Innovations
September 11, 2019
- StackRox Launches New Sumo Logic App for Kubernetes Security
- Sumo Logic Showcases the Intelligence Economy at Illuminate 2019
- Multi-Cloud on the Rise and Open Source Tech Like Kubernetes is Disrupting the Modern Application Stack, According to Sumo Logic Research
- Accenture Acquires Pragsis Bidoop
- Lucidworks Fusion 5.0 Features Data Science Toolkit Integration & Microservices Architecture Orchestrated by Kubernetes
- TIBCO and Asia Pacific University of Technology and Innovation Announce Enriched Collaboration
- Looker Brings the Data Community Together at JOIN 2019
- InfluxData Launches InfluxDB Cloud 2.0
- Nationwide Drives Data-enabled Culture with ‘Fit to Fly’ Analytics Strategy
- Hazelcast Enhances Real-Time Capabilities for Financial Services Industry
Most Read Features
- Is Python Strangling R to Death?
- Can We Stop Doing ETL Yet?
- Big Data File Formats Demystified
- Seeing the Big Picture on Big Data Market Shift
- Re-Imagining Big Data in a Post-Hadoop World
- How to Build a Better Machine Learning Pipeline
- Is Hadoop Officially Dead?
- 10 Big Data Trends to Watch in 2019
- Why Knowledge Graphs Are Foundational to Artificial Intelligence
- AutoML Tools Emerge as Data Science Difference Makers
- More Features…
Most Read News In Brief
- HPE Acquires MapR
- R Backers Tout Funding Milestone, Seek Comeback
- H2O.ai Tops Off Funding to Accelerate AI Adoption
- Startup Rockset Adds SQL to DynamoDB
- AI, Analytics Help to Propel Wind Power
- MapR Says It’s Close to Deal to Sell Company
- War Unfolding for Control of Elasticsearch
- StreamSets Eases Spark-ETL Pipeline Development
- California’s New Data Privacy Law Takes Effect in 2020
- How IBM Is Turning Db2 into an ‘AI Database’
- More News In Brief…
Most Read This Just In
- Cray ARM-based System ‘Ookami’ to Serve as Testbed for Computational Studies at Stony Brook
- Illumina to Share their Data Virtualization Journey at Gartner Catalyst Conference
- Cloudera Agrees to Acquire Arcadia Data
- Report: SAS Sees 105% Growth in AI Revenue
- Ascend Introduces Queryable Dataflows for Faster Pipeline Development and Overall Time to Big Data Success
- SnapLogic Delivers AI-powered Pipeline Recommendations and Azure Databricks Support with Latest Platform Release
- Accenture to Acquire Analytics8, Australian Analytics and Data Specialists
- New Graph Database Performance Benchmark Confirms Graph Databases are Ready for Solving Real-World Business Intelligence, Data Challenges
- SAS Fulfills Pledge to Support HBCUs with Software and Partnerships
- NVIDIA vComputeServer with NGC Containers Brings GPU Virtualization to AI, Deep Learning and Data Science
- More This Just In…
September 23 - September 26New York United States
October 20 - October 22Charlotte NC United States
October 23 - October 24Berlin Germany