Scalability Answers Found in the NoSQL Family Tree
When it comes to big genealogical databases, they don’t come any bigger than the one at FamilySearch. So when the arm of the Church of Latter-day Saints started to reach the limits of its Oracle data store, it branched out into the distributed NoSQL world instead.
If you know anything about the Mormons, you know that they’re meticulous about storing genealogical records. From its headquarters near Salt Lake City, Utah, the LDS church has amassed a huge stockpile of information about the familial connections of not only its own members, but everybody else’s, too.
The job of managing all this genealogical data falls to FamilySearch. Previously known as the Genealogical Society of Utah, FamilySearch is the largest genealogy organization in the world. The group has collected billions of primary records over the years that document the births, marriages, and deaths of billions of people around the world through the ages. These records are archived in millions of rolls of microfilm that the LDS church has stored in a vault carved deep inside a Utah mountain. Other similar vaults have been turned into data centers that store digital copies of the records.
In addition to storing the actual records – physical and digital – FamilySearch maintains databases that organize the data and make it searchable to the world. One of those is a historical records repository that stores about 7 billion records that is searchable through Solr. The conclusion database, meanwhile, links about 1.4 billion records together to create what the church calls the Family Tree of Man. These databases are accessible to the public over the Internet. Anybody who wants to conduct their own genealogical research is able to access them for free.
For years, FamilySearch ran its conclusions database on Oracle‘s eponymous relational database management system (RDBMs) using a pair of replicated 64-core Fujitsu servers equipped with about 48GB of RAM apiece. But as the number of Internet searches began to increase at a substantial clip, the organization realized in the 2015 timeframe that it needed to scale the database up if it was going to maintain a decent response time.
FamilySearch CTO Tom Creighton says group’s first inclination was to scope out what it would take to keep the system running on Oracle. “We knew we were approaching a time when we would hit at least a 10-fold increase in transaction rates based on the increase in user traffic,” he tells Datanami. “So we started looking at, how are we going to handle that?”
Sharding the Oracle database was one option that was quickly ruled out. “There are techniques to do it, but they’re not perfect,” Creighton says. “We’ve tried all the techniques for this application, and we’re not happy with the results.”
Keeping the system running on a scale-up, non-sharded Oracle database would have required a big investment in hardware, taking it into the Exadata realm. However, the projections based on real transactional data showed that, even with several Exadata machines, the performance was going to be average at best, according to Creighton.
“We didn’t have a lot of good choices,” he says. “We could have gotten it to work with Oracle, but it would have been very much more expensive because of the kind of equipment we had to use. We had to scale up, not scale out…So we started looking at other alternatives.”
The group gravitated toward NoSQL databases, which are built on scale-out architectures. Early in 2016, Creighton and his IT staffers undertook a bakeoff to see which NoSQL database was better suited for its application. The candidates were a pair of open source products, including a popular document database, and Apache Cassandra, which is often described as a wide-column store.
“We did a bake-off…and Cassandra won,” Creighton says. The document store “did really well at one level of sale…But we had trouble running it in Amazon in a multi-node, at-scale environment. That’s where Cassandra just flew. It just worked.”
Today, FamilySearch is running its conclusions database on the commercially supported, enterprise-strength version of Cassandra sold by DataStax. FamilySearch runs its DataStax Enterprise cluster atop three rings that in turn are running across 39 nodes of AWS’s Elastic Compute Cloud (EC2) R3 2XL processor types, accessing about 1800 GB of RAM, according to Creighton.
Since the organization switched over to DataStax Enterprise about a year ago, the demand on the application has “gone through the roof,” Creighton says. Queries have increased by about 7x, to about 500,000 unique queries per day, including a peak load of about 30,000 queries at any one time. “Those nodes are at most 20% to 24% CPU utilization, so they’re not sweating it too much at all to keep up with the load,” he says.
The system is also highly reliable, thanks to how it uses Cassandra rings, Creighton says. FamilySearch is running with a replication factor of three, which increases the number of copies of data that it’s storing in each ring, but also makes the overall cluster more resilient to losing nodes.
“If we lost an entire availability zone, the ring would become unstable,” Creighton says. “But the cluster would remain stable, because there’s another entire ring in a different region than wouldn’t be affected.”
FamilySearch has been quite satisfied with the shift to NoSQL technology for both major databases, including the historical records database and the conclusions database. But those aren’t the only places where NoSQL is making a mark at FamilySearch.
“We’re very happy with Cassandra and DataStax, and we’re implementing it in other applications as we speak,” Creighton says says. “We don’t exclusively use Cassandra, but many of our applications are migrating that way, and certainly for the new ones where we recognize the need for scale and high reliability, it’s very nice for that.”
While NoSQL databases can be more scalable and fault-tolerant than relational databases, they don’t come without their own tradeoffs. For starters, NoSQL databases often sacrifice data durability (the “D” in ACID) in exchange for a greater capability to scale. That’s not necessarily a tradeoff that a bank, for instance, would be willing to make, because it would hurt its ability to process transactions in a reliable manner. For FamilySearch, the tradeoff made sense because of the nature of the application, which had a read-write ratio on the order of 100 to one.
There’s one more factor that brought FamilySearch to NoSQL: it had already done the hard work to migrate its database model to a document-based view of the world. Most NoSQL databases store data in JSON or similar data structures, as opposed to the highly structured rows, columns, tables, and indexes of a relational database. Because FamilySearch was storing documents, it did the work to modify its Oracle database schema to use JSON years ago.
“We already had that in place by the time we made the transition to Cassandra,” Creighton says. “While we still transformed our data model, it was not nearly as large an effort had it been had we been on a fully normalized database.” The group also had built an API layer that further shielded the underlying Oracle database from schema changes, so that helped minimize disruption, too.
While genealogy is a hobby for many, it’s a serious endeavor for FamilySearch, which is why it’s invested so much time and money into building data systems capable of handling great loads. (The company is also a pioneer in pushing the state-of-the-art in software and hardware for automated scanning via microfilm, but that’s another story.)
As demand for its free service increases around the world, FamilySearch is taking steps to ensure that everybody has a good experience searching its Family Tree of Man. That includes establishing databases in other continents, including Europe, Africa, and Asia, to ensure that the latency doesn’t get too high. DataStax Enterprise and AWS cloud computing are the key ingredients that make that possible.
“We’ve been very pleased with Cassandra, and especially the DataStax toolset,” Creighton says. “Although there’s a steeper learning curve to get started, once you get past that, it’s much better in terms of managing a large cluster and the scale options are much greater with DataStax than what we had seen with” other NoSQL databases.
Cassandra’s scalability is legendary, and that status is being born out at FamilySearch. “Since the first application,” Creighton says, “we’ve pretty well proven that the claims by Datastax are legit: that it scales more or less linearly with the equipment.”