It Takes a (Data Science) Village: How GRMDS Fosters an Open Ecosystem
Openness is one of the hallmarks of modern data science – openness of data, of code, and of methods. By sharing our successes and failures with others, we raise our collective intelligence level beyond what we could each achieve individually. This open dynamic is driving the growth of Global Association for Research Methods and Data Science (GRMDS), a budding AI ecosystem composed of 35,000 data science professionals.
GRMDS is emerging as a good source of information for solving real-world data science problems. The platform, which was created by data scientist Alex Liu in 2009, started small, with members assembling at meetups and getting together at user conferences. Over the years, it continued to grow, as Liu took data science jobs at companies like ZestFinance and Shopzilla.
Liu, who earned an MS in statistical computing and PhD in sociology from Stanford University, kept GRMDS going even after joining IBM in 2013. He eventually earned the title of chief data scientist and distinguished data scientist at IBM while working on Watson before leaving Big Blue last year to devote himself full time to GRMDS, where he is president, and to found his data science startup, RMDS Labs, which provides data science consulting and education.
2019 was a breakout year for Liu and GRMDS, thanks to the launch of a new Web-based AI platform at www.grmds.org and its inaugural user conference, which attracted about 1,700 attendees last month in Pasadena, California, where both GRMDS and RMDS Labs are based.
What makes GRMDS so successful is openness, and how GRMDS members are willing to show other GMRDS members new approaches for solving data science problems, Liu says.
“A lot of people are quite limited in what they know about models. A mathematical failing,” he tells Datanami. “If you just use one model to solve all the problems, then in average you’ll get nothing. If you have a hammer, you see everything as a nail. That’s not going to work in the data science world. So we feel that it takes a big community for people to work together to solve all the problems and to improve the success ratio, so that’s why we build this ecosystem.”
GRMDS encourages members to share code, data, and their methods, which are the key ingredients for any data science project. When members sign up for GRMDS (which is free), they are asked to submit a description of the data science project or projects they are working on. It’s not required to be a member, but it’s encouraged.
Arguably the most popular feature on the GRMDS platform is an automatic rating system that uses algorithms to rate each project. The rating is based on description provided by the user, as well as some external data, such as GitHub activity. It’s similar in some ways to the RG score on ResearchGate, which is used to predict the impact that a given author is expected to have.
“On our platform, we also have a score called an Impact Score, which we try to measure the actual practical impact of a data science project, and people love it,” Liu says. “They think we are the only one who are doing this, who measure how impactful a data science project can be.”
The Impact Score also factors into recommendations that the GRMDS platform makes to members. The idea here is to recommend other data science projects for members to check out, with the goal of helping members get new ideas – including actual code and data – for solving data science challenges. The more members join GRMDS, the better the recommendations get.
RMDS Labs, which manages the GRMDS organization, has generated the RM4Es framework for validating data science processes. The framework, which defines a workflow for managing the flow of source data through training and testing processes, is a key element in the group’s ability to assess data science work.
GRMDS is also working with Harvard University’s Dataverse project, which is a repository for research data across a number of fields, from physics and math to business management and information sciences. The Dataverse project seeded the GRMDS repository with data from about 9,000 research projects, Liu says.
The collaboration with Dataverse – where methods, models, and data are shared in the open — will help move the ball forward on the scientific community’s reproducibility crises, Liu says.
“You’ve heard about the reproducibility crisis in the scientific community,” he says. “Now everybody is trying to come up with a solution to solve that problem, which is one of the reasons that Harvard University started the Dataverse project. So now there’s a need in data science for them to prove their project and workflow can be reproduced. In order to do that, they need to make it open.”
With thousands of highly educated data scientists hooked into the network, GRMDS has the intellectual heft to parse the scientific claims and render a verdict. That’s another value-add that GRMDS can add to the data science community, Liu says.
“We are working together to try to come up with a system that everybody can trust and use to verify the replication and reproducibility, because some people want to get a third-party to certify it so their science work can be replicated so it has more value,” he says.
The sharing of data is a key aspect of GRMDS and its mission to move the needle on data science success. “One reason we have higher failure of research for data science projects is people do not have enough data,” he says. “Mathematically, if you don’t have enough data, your model will be biased.”
Openness is certainly a major theme with GRMDS, and it actively encourages all users to share their data and models. But that’s not to say that full and complete openness of code and data is absolutely mandatory in the group.
According to Liu, the organization will try to work with members whose hands are tied because of the company they work for, especially among financial services companies and banks.
“We probably will have to establish some separate system, where we will still encourage them to collaborate, but in a more small scale and in a protected environment,” he says. “Even for corporations, they do not need to really protect everything. Maybe just the data or just some algorithm, so they can still be perhaps okay to share some of their other assets. That way they can still contribute back to the community.”
Looking forward, GRMDS is set to grow in 2020. The organization keeps the lights on by monetizing traffic to its website and has plans to sell sponsorships. There are also potential book deals that could be cut. The group has also established connections with universities and other learning institutions in the Southern California region, including CalTech, JPL, USC, Loyola Marymount University, and the University California campuses in Riverside and Los Angeles. The group provides capstone content for several of these university data science programs.
With data science and AI set to have a bigger impact on business computing in the years to come, GRMDS is certainly well positioned to play a role, which is reason enough to keep your eyes on it.