How LendingTree Sumo-Sized Its IT Monitoring
As an online lending marketplace, LendingTree is totally dependent on the Internet. So when one of the servers that funnels sales leads into the company went down a couple of years ago, the company had a serious problem on its hands. To make matters worse, a lack of monitoring meant LendingTree’s IT department wasn’t aware of the issue initially. When it was finally discovered, the company took steps to make sure it would never happen again.
It was a cascade of errors, as computing problems like this are prone to be. A server with a full hard drive was used for one half a two-way Microsoft BizTalk Server cluster. The redundancy that was supposed to be there failed, and as a result, one-quarter of the publicly traded company’s leads were essentially blowing right out the digital window.
A general lack of situational awareness on the part of LendingTree’s IT department made the problem worse. The company lacked any type of centralized logging that could have detected the failed BizTalk server and alert the IT department to the problem in a timely manner.
“We had some monitoring on the database that said we’re not getting leads,” said Jeremy Proffitt, who worked in LendingTree’s IT department. “We didn’t know why. We didn’t know what was wrong.”
$6,000 Per Minute
Proffitt’s first task was to provide some visibility into the systems responsible for ingesting the loan inquiry forms coming in over the Internet, processing the information, and then sending the leads out to lenders that would fulfill the loans — a process that Proffitt likens to a “steam plant.” For each minute of downtime, the company estimates, it loses up to $6,000 in revenue.
The company had a small subscription to a log monitoring application from a company called Sumo Logic. Proffitt didn’t know much about it, but decided to see if he could make it work. If he could just manage to centralize the storage of logs from BizTalk and IIS Web servers, then at least he could see if any of those servers were down. Dead servers, after all, generate no logs, and if they’re generating error messages, that’s even better (or worse, depending on your point of view).
“We said, let’s make a full court press into Sumo. All I did for months was write Sumo queries and alerts, and managing and monitoring,” said Proffitt, who was promoted to staff site reliability engineer (SRE) because of his work with Sumo. “My job was just to go through each server, make sure the logs were coming in, make sure they’re working like they should.”
As he worked through the system, he found more issues, including with timestamps (some were on UTC while others were EST). He kept adding alerts using Sumo Logic’s software, sometimes by using a pre-built Sumo connector, sometimes by adding a string of Sumo-supplied code to Linux servers, and sometimes just by sending syslog statements to a URL (Sumo Logic is a hosted service).
“It was just writing out basic alerts,” Proffitt said. “For example, if there are more than this many errors, email somebody. Or there should be four servers running this software. If only three are sending logs, it’s a problem, obviously. Stupid, simple checks. It really is not that difficult.”
Manually monitoring computer, network, and application logs in the modern age is painful and often fruitless, which is why companies like SumoLogic, Elastic, Splunk, and others that bring provide intelligence to large amounts of log data are getting good traction. Sumo Logic competes fiercely with those firms, and touts the scalability and ease-of-use of its cloud-hosted service as its key competitive differentiators.
What Proffitt wants most from Sumo Logic is to be the first person to know if one of LendingTree’s servers or services starts sputtering. He is generally that person today, but it wasn’t always the case.
“There were days when a [chief product officer] would say ‘My service is down. I can’t log in. What’s going on?’ Or the call center would call and say ‘We’re getting calls from people. The website’s not working,” he told Datanami during last week’s Sumo Logic Illuminate 2019 conference held in Burlingame, California. “It was rarely us that knew anything was down.”
Today about 20 engineers and operators at the company use Sumo Logic dashboards on a daily basis. A series of 50-inch monitors in the company’s Charlotte, North Carolina headquarters keeps Proffitt and his team fully aware of the status of all critical systems. The company has also started using Sumo Logic to monitor other aspects of the business, besides the CPU, storage, and memory utilization of servers.
“It’s a lot different now,” he says. “We’re looking at key points in our revenue stream. Because uptime is uptime, but at the end of the day, you have to have revenue to have a paycheck.”
‘It Can’t Be This Easy’
When “Hell Week” happened two-and-a-half years ago, Lending Tree was mostly on-prem, and operated about 200 servers. Since then, the growing company has migrated nearly all of its servers to the public cloud, specifically AWS. About 10% of its 200-odd services today run as Kubernetes microservices, and others run as serverless AWS lambda functions, Proffitt said. Getting visibility into these containerized and virtual services can be difficult, but Sumo Logic handles it all.
Not only does Sumo Logic keep Proffitt on top of LendingTree’s IT and business functions, but it keeps the company aware of the state of its partner’s systems, too.
For example, when LendingTree detected an increase in response time for a service from one of the three major credit bureaus, it alerted the bureau, which started looking into the issue. The bureau elected to fail over its service its backup data center, and a few minutes later, its original service failed completely. LendingTree had detected the deteriorating condition before the credit bureau had, and helped to avert any disruption to the service.
Sumo Logic also provides LendingTree with visibility into important operational metrics, such as database operations per minute, that AWS does not provide. And while LendingTree has a separate security group with its own security information and event management (SIEM) software, Sumo once enabled Proffitt’s team to detect a brute-force attack before the security team picked it up.
Lending Tree ingests about 120GB of log data daily into its Sumo Logic environment, which the company keeps mostly for 30 days. Managing this volume of data isn’t simple, which Sumo Logic says is one reason that its cloud service has grown.
Having flexibility is important, and when one of LendingTree’s departments requested keeping log data for six months for legal reasons, Proffitt already had solution in mind.
“All I did was go into Sumo, configure an index for 90 days retention, hand them a URL, and said ‘Post your data here.’ All he has to do is insert a couple lines of code,” he says. “We don’t want a legal challenge, but the last thing you want to do is spend 5,000 hours of manpower to get the data.”
Similarly, when the company recently acquired Quote Wizard, there was some concern about visibility into its systems because the company ran its services in Microsoft Azure, while LendingTree is an AWS shop. It turned out those concerns were ill-founded, and Sumo was the reason.
“We hadn’t seen their stack. We didn’t know what it was. But we already had a solution for it because Sumo has already thought of it,” Proffitt says. “I just turned to the guys from Quote Wizard and sent them a link to the GitHub site and said ‘Look, this is all we have to do.’ They looked at me and said, ‘No it can’t be this easy.'”