Why are data scientists tripping over themselves to get their hands on LinkedIn’s data? What’s it like to run one of the world’s biggest social media sites, and how can machine learning algorithms contribute to the creation of economic opportunity for a global workforce? We recently posed those questions (and more!) to Igor Perisic, Vice President of Engineering at LinkedIn.
Alex Woodie: Igor, thank you for agreeing to this interview. First, please tell us about yourself and your role at LinkedIn.
Igor Perisic: I’ve been at LinkedIn for 7.5 years. My educational background is in statistics [Harvard PhD]. At LinkedIn, the easiest way to describe what I’m doing is relevance infrastructure, machine learning, and personalization for the site.
AW: When I make a new connection on LinkedIn, that’s the machine learning working behind the scenes?
IP: By just making a connection, you’re hitting two or three slightly different services behind the scenes. One is the social graph itself. That is one beast to scale from a pure infrastructure perspective. If you recall a company called Friendster in the past–had they been able to scale the social graph itself, the service itself, I believe you could make the argument today that Facebook itself would not exist. Then “People you may know” is an algorithm that we built…. to connect people, that is a service and an algorithm and methodology that my team is responsible for.
AW: So you have the graph that powers connections, and the machine learning algorithms run separately.
IP: Correct. On the infrastructure side…you have things like Hadoop, Spark–anything that’s offline distributed computing. You would have things like Kafka to make sure that the tracking events or the impressions are computed correctly so the algorithms can do the proper discounting and getting the right implicit feedback from members. You have a search engine. You have a recommendation engine with different levels of scoring within it. And you have a statistics server. Then you have all the algorithms that can leverage these components.
AW: So your job is to make this stuff work and improve it?
IP: Yes, my team is responsible for getting all of this to work, from the algorithms to the platform. The last part of the question can be interpreted in multiple ways.
AW: What challenges do you face scaling the infrastructure to handle millions of users?
Igor Perisic, Vice President of Engineering at LinkedIn
IP: Whenever you have any machine learning or data mining or relevance [systems], they will at some point in time tell you that they’re spending a significant amount of time to make sure the data is clean, that there’s no error in tracking, that the recording of the data went properly, that the schema is well defined and that the data that we received adheres to that schema and that they understand it.
The reason I joined LinkedIn a long time ago is that normally people in machine learning spend 80 to 90 percent of their time cleaning the data, making sure that it’s in the right format upon which they can build up the algorithms. LinkedIn’s [data] is very structured data and it’s very clean. It’s not 100 percent clean, but it’s very clean in the sense that you’re going to describe yourself very cleanly because it’s your professional profile. Facebook’s data is a lot quote-unquote “dirtier”–you’re getting personal with a lot of things, and Twitter is even more so.
Once you have that data clean, which means that the flow of your data through the pipeline is verified, the schema is very well-defined–then the things to scale it, nowadays it’s essentially based on Hadoop. The thing that’s hard to scale is graph computing, because you need to do lots of joins and lots of joins are hard to do when all the data doesn’t necessarily reside on one box.
After that, the thing I worry about is serving it, because offline you have time, online you don’t. Let’s say you need to do search relevance or ad targeting. The timeframe you have to compute your thing is very small. It’s going to take hours, whereas for “People you may know,” we refresh that on a cycle which allows us to take time…One more thing you need to scale is your A/B testing pipeline. When you want to look at different relevance algorithms or different machine learning algorithms, you need to evaluate if one is doing better or worse than the previous one. In order to do that, you need to figure out, not necessarily only in the metric that you track, but with respect to [all of] the metrics your entire company relies on. You don’t want to blindly cannibalize the performance of one product for another.
AW: It sounds like it gets complicated in a hurry.
IP: Especially when you have 400 experiments that are running at the same time!
AW: LinkedIn is a professional network. How does that professional nature change what you do, compared to how Facebook or Twitter might handle things?
IP: We always ask ourselves, what value in the professional workplace does a feature bring to the member? For example, sometimes on social sites like Facebook or Twitter, we see games like “test your IQ,” or “what word do you see” or “tell me what number comes next in a series.” These games can be engaging and people like to play with them. But they’re not really that professional.
We want LinkedIn to be the platform where you connect to opportunity, to people, and to knowledge that’s relevant within your professional environment. All of our products and everything we do is oriented towards “LinkedIn’s vision,” which is to create economic opportunity for every member of the global workforce. It’s all in the context of that professional setting that you analyze what you do.
AW: LinkedIn recently launched its Economic Graph Challenge. Can you tell me more about it?
IP: When I started at LinkedIn 7.5 years ago, there was always demand to take a look at our data. It’s natural in the sense that, if you’re in an academic environment, you like to play with algorithms, but you don’t have the data at a significant scale. You could go to Stanford or the University of Michigan or any university and say, “I’m going to build a social network and see how groups are getting formed.” If you have 20,000 students, say maybe 10 percent of them get on the platform, and only 10 percent are active, and only 10 percent contribute [to your study], so that’s 20 students. Suddenly you’re propagating information through a social graph of 20 people, and so it’s not going to be that interesting.
But at scale it becomes interesting, because at scale it mimics society to some extent. So there was always an interest in getting our data and “playing with it.” At one point in time, we became confident that we had the tools to open up our platform a little bit, and the willingness to put some significant energy in to it. We’re also not sharing the data – we’re giving controlled and monitored access to [some of] the data.
It’s an opportunity [to bring in experts] from outside LinkedIn because we believe we don’t have all the questions. Reaching out to the broader community was suddenly an opportunity to just ask the question: If you had our data, what questions would you like to answer in an economic context? What question would you like to tackle to create economic opportunities for the global workforce?
We received about 220 high quality proposals, which was far more than we were expecting. Recently we selected 11.
AW: What areas will they explore?
IP: They go to a lot of places. Some of the proposals are around just pure text mining on dynamic graphs. Those are from Duke University. Some are around the skills gap, and mining trends around the supply and demand for skills. Some are around career paths. We looked at evaluating projects on three dimensions. First is about the potential impact of your question. Second, do we believe we have the data that could be used to answer that research question? And lastly, do you have the ability to actually execute upon it?
AW: How will the Economic Graph Challenge benefit individuals or a group of people?
IP: It’s something that would certainly help policymakers or mayors or leaders of cities to get better insight into what kind of business they should attract, and then whether or not it’s a good match. There’s an example around the skills gap, identifying what could be good for you or for me–if I want to change my career and do something different or if I wanted to expand my career opportunity into one field, am I well suited for it?
AW: There’s been a run on data scientists recently. Does LinkedIn use its own tools to identify the best data science candidates it would like to hire?
IP: We’re using our tools to find data scientists, of course, but so do a lot of our competitors. LinkedIn generates a lot of its revenue from a recruiting tool we have, which is an enterprise tool. We’ve been fundamentally data driven for a significant amount of time and we’ve been very much in tune with data science from a very early time. But what helps us a little bit is we understand what drives a data scientist across different dimensions. We understand how to give them an environment where they can succeed.
AW: Igor it was great talking with you. Best of luck to you and thank you for your time.
One on One with IBM’s Global VP for Data Analytics
Inside LinkedIn’s Expanding Data Universe