Quansight Tackles Support Gap in Python Data Community
Here’s a worrisome statistic to ponder. Despite the millions of people who rely on data science packages like SciPy, NumPy, and Scikit-learn, fewer than five people in the world are paid to maintain them (the number is actually four-and-a-half). The situation worried Python data science luminary Travis Oliphant so much that he co-founded a company called Quansight to do something about it.
Oliphant and Matt Harward co-founded Quansight in early 2018 as an incubator for open source software projects in the Python data science ecosystem. The company, which Oliphant describes Quansight as a “public benefit corporation,” pledges to help connect organizations that want to use Python to build data science and machine learning software with the developers who do the actual work.
There are plenty of technical problems to tackle in the world of machine learning and data science. But the real problem is the lack of funding for the open source community that drives much of the technological progress across hundreds of individual projects.
“But it’s not just a technical problem. The real problem is funding. The real problem is how do we fund this? Because we have this stack that was built, and now there’s not that many people,” Oliphant said during a presentation at last week’s OmniSci. “There’s literally not that many people paid to work on this. I’m trying to change this with PyData, with Quansight Labs, but we’ve just gotten started.”
It’s too important to be left to volunteer time, Oliphant said . “We need to come together….and cooperatively pay for the open source to be done, and have an organization like Quansight serve as the project manager.”
OmniSci, developer of a GPU-based SQL database used for big data analytics, is one of Quansight’s early patrons. Together, OmniSci and Quansight are funding about 10 open source developers who work on connecting OmniSci’s open source database with the broader PyData community.
“The money that comes in goes to pay the salaries of the people that are there,” Oliphant says. “It’s not a profit making activity. My business partner is in the audience. He can definitely account for this.”
Oliphant, of course, is the perfect person for this job. As the primary creator of NumPy, the founding contributor to the SciPy, and a co-founder of Anaconda (distributor of the Conda package), Oliphant has a solid track record of confronting and solving challenges around the incompatibility of disparate open source projects.
Oliphant’s “ah ha” moment occurred about 15 years ago, while working on his PhD in biomedical imaging at the Mayo Clinic in Rochester, Minnesota. The younger Oliphant was excited to learn that somebody had developed a Python-based tool that allowed him to work with arrays. But it worked only with a Python package called NumArray, not Numeric, which is the one that he was familiar with.
“That kind of for me was the final straw,” he said in Mountain View last week. “I have this great library I want to use, but now it’s on this other array object that I like. [So] we can’t use them together. There’s this big silo-ization.”
Oliphant took time off as a professor and wrote NumPy to unify the two siloed technology. Many of developers have benefited tremendously from that work over the years. He took this same basic approach in 2012 when he co-founded Anaconda (originally Continuum Analytics) with Peter Wang and created a unified distribution of common Python tools, called Conda.
While NumPy addressed tool disparities around specific uses of Python, Conda has had a much wider solved impact by helping to standardize and ensure compatibility addressing hundreds of packages for Python, R, and other languages. It’s been downloaded tens of millions of times and arguably has improved the productivity of every data scientist who touches Python.
In fact, Oliphant’s work had a hand in enabling the first image of a black hole to be created. NumPy was one of the libraries that MIT grad student Katie Bouman used to create the algorithm that eventually yielded the image that she’s credited with capturing from radio telescope data earlier this year.
“It’s very satisfying to realize that scientists are discovering amazing things about the universe using the stuff that you did, not realizing it would have that kind of impact,” he said. “Very, very encouraging.”
Unifying Deep Frameworks
No good deed goes unpunished, as the saying goes. Thanks to the continual onslaught of technological change and human creativity, the fight to create unified packages likely will never end, for Oliphant or anybody else who wades into the middle.
When deep learning emerged on the scene a few years back, the wheels of diversity and technological complexity started spinning a bit faster, and we suddenly we had an array of deep learning frameworks to choose from.
“Tensorflow and PyTorch are the big ones, but Amazon is still sitting with MXnet,” Oliphant said. “Chainer is actually more community connected, whereas the big cloud providers and the big organizations essentially went out and reinvented their own wheel.”
There is a good side to technological diversity, according to Oliphant, as it gives customers choice. Products with better features, better architectures, and better support theoretically should outcompete inferior products. But the real world is a lot more fragmented, with many developers chasing many markets, which ramps up the technological diversity.
That creates challenges, including a lack of compatibility across frameworks. Some of those challenges can be overcome with engineering, but that takes time and money. For Oliphant, the emergence of tensors and array projects for deep learning like Tensorflow, PyTorch, MXNet and the rest, that poses a problem.
“Remember back when I gave up tenure at an academic post to unify this fledging array market? Yes, it’s a lot worse!” Oliphant said. “It kind of makes what I did quaint and cute. ‘That was nice. Good job. Thanks for trying to unify things.’ Now we’ve all grown up, so we have lots of diversity.”
Quansight’s goal is not to unify the entire market for deep learning frameworks, which is probably impossible, but instead to develop solutions that provide a path forward for customers. For Quansight, the built-in interoperability benefits of Python provide a good starting point.
“I love other languages, Java one of the least,” Oliphant said. “There’s specific reasons for that. I’m not just a language bigot. I acutely like language interoperability. That’s one of the key things. I love interoperability. I like Python because it’s about interoperability. It’s about helping lots of people using things together. And Python, because of the impact of machine learning, is one of the important languages for data science.”
One branch of Quansight is Quansight Labs, which was created to be a home to developers, community managers, designers, and documentation writers in the PyData community. The group is funded by grants, donations, and industry-sponsorships. Another branch is called Open Teams, which works in a more direct manner with sponsoring organizations, such as the work Quansight did with OmniSci to extend the GPU database maker’s GUI interface, called Immerse, into Juypter Labs.
Oliphant told Datanami that Quansight is what Anaconda was initially intended to become before events converged to take Anaconda in a different direction – i.e. an enterprise data science platform. Now with Quansight, Oliphant is helping to ensure that earlier versions of him don’t have to make hard decisions between continuing to work with open source and eating.
“What happens to me is my little side projects became my life [and were] constrained by the fact that I have a family,” Oliphant said. “I couldn’t afford to…hack around all day and not get paid. “
A lot of developers are in that position today. “And I’m trying to make sure that, as that happens that there’s actually a place for them to land, a place for them to come, a place where they can continue adding value,” he continues. “That’s my mission and that’s what I do.”