Big Data Architecture at LinkedIn
Interview with Siddharth Anand
by Michael Floyd
In this interview at QCon London, LinkedIn's Sid Anand discusses the problems they face when serving high-traffic, high-volume data. Sid explains how they're moving some use cases from Oracle to gain headroom, and lifts the hood on their open source search and data replication projects, including Kafka, Voldemort, Espresso and Databus.
Siddharth "Sid" Anand has deep experience designing and scaling high-traffic web sites. He is a senior member of LinkedIn's Service Infrastructure team. Prior to joining LinkedIn, he served as Netflix's Cloud Database Architect, Etsy's VP of Engineering, a search engineer and researcher at eBay, and a performance engineer at Siebel Systems.
Michael Floyd: I am here with Sid Anand today. Sid has joined recently LinkedIn. How are you doing Sid? So you made the move over to LinkedIn. What is your job title and just tell us a little about the move over to LinkedIn?
Siddharth Anand: Yes, sure. I spent 4 and a half years at Netflix, it's an excellent company to work for, and most of my time there was spent solving big data problems in the online space. I did that for about 3 years as well as working on the Cloud and after 4 and a half years you kind of want the next big challenge and I heard LinkedIn was trying to solve some of the same problems. They were looking a solution, an alternative to Oracle and they had invested in growing a very large team of experts in the area of big data. So I was really keen on working with that team. Also LinkedIn had a history of developing and open-sourcing a lot of their technology, so they have about 10 different types of technologies in search and data replication and data bases and I was interested in working on those areas and definitely working on something that would be open source at some point.
Floyd: So you are not too focused on search, you are more involved in a big data side of the equation.
Anand: Yes, I am more involved in the online serving of high-traffic, high-volume data.
Floyd: And so you kind of characterize the core problem then is LinkedIn is trying to move away from relational databases and away from Oracles. Tell us about the migration path, is it a slow migration or... you're not trying to replace everything, right?
Anand: That is a great question. Typically with products like Oracle you are not trying to replace it, you are trying to buy yourself some head room. So there is a certain amount of traffic the server that you own can handle and when you run of head room you need to invest in a bigger server. So you can keep trying to do that but that is not a winning game. So one alternative is to move some of the very expensive use cases off of it onto something else. We'll never completely get rid of Oracle, but we'll buy ourselves head room so that we can scale off of it.
Floyd: Can you give us an overview of the stack? I know you guys are big Scala users and the data architecture — so the programming architecture and data architecture.
Anand: So the site runs on Java and the website is basically running off of Jetty, so we are not using Tomcat we are using Jetty. And we use Apache traffic server in front of Jetty, so every server effectively is running a Jetty container behind Apache and it runs of Java. We have isolated cases where we use Scala; one of those is the network graph and the other one is Kafka. So it is not widespread, it's for particular applications.
Floyd: One thing that I just wanted clarify for our readers is there are various data models and maybe you can just kind of explain to our audience the features and benefits of document stores and graph database and key value and why you would choose one model over another.
Anand: That is an excellent question too. So one way to understand the problem is you have a lot of data in all of these cases and the more complex the query you are trying to execute the less performant it will be. And also some complex queries are not amendable to sharding or data partitioning. So an example of that is a graph. A graph is difficult to divide or partition and the first thing in anyone's tool chest when they think about scaling is they think about partitioning the dataset. So if I can take a dataset partitioned over a hundred servers, I've just scaled a hundred-x, so I've got a hundred-x more capacity than I had before. Graphs are notoriously difficult to do that for.
So some vendors like, say, Neo4j they provide a solution that runs on one machine because they recognize this is a difficult problem, but for companies like LinkedIn or Facebook where the graphs don't fit on one machine that is not a good solution. For document stores you get a lot of rich semantics like you can do look-ups by arbitrary tree depth and using wildcards and very complex queries. I haven't got a lot of experience using them at high scale and so there's always this question of, since it's doing complex things can it really scale? Because that is a typical tradeoff. They are good for certain things: they are good for storing data in a secondary index. Now key value stores are good for primary storage of data where you access data by primary key. So if I had to put this all together in one system I'd have my core data in a key value store like Casandra, Riak, one of these primary key-value databases that have secondary indexes on say a document store like Couch or Mongo and the graph problem is sort of a separate problem.
The right solution is to have a fairly thin client that is integrated into all of these apps and have most of your logic running in the servers that you own as the infrastructure team.
Floyd: One of the things I heard in your talk today you mentioned that you work with several application teams at LinkedIn. What features are they asking you for as you evaluate this?
Anand: That's another great question. So when you provide one of these services to your client you want them to easily integrate it into your application, so an app team comes and says I want to use Voldemort. Today at LinkedIn Voldemort has a very fat client and there can be lots of problems in performance on one of the applications using Voldemort and then they will think the problem is in the Voldemort client itself because the client is so fat, and doing so much work. And it could be the case sometimes. The right solution is to have a fairly thin client that is integrated into all of these apps and have most of your logic running in the servers that you own as the infrastructure team. So one model that we are going to is infrastructure as a service. Effectively all these infrastructure features should be run as a service and app teams just take a very thin client and use that thin client and they don't really need to know details of how it works.
Floyd: So you mentioned Voldemort not all of our audience may know what that is. Can you just give us an umbrella overview?
Anand: Voldemort is a dynamo-inspired key-value store that was created by Jay Kreps in 2008. It's been used at LinkedIn ever since, it's used for 2 main use cases, one of which is kind of like a key value store replacement for some Oracle use cases. Another one is more interesting, it's as a read-only store, so effectively if you use LinkedIn on your home page you will notice some widgets to the right hand side, one is people you may know, the other one is jobs you may be interested in or events you may be interested in. Those complex recommendations are generated by analytic jobs running on Hadoop and Hadoop generates an output which is actually an index data file for Voldemort and Voldemort loads that Hadoop output and serves live on the site. So we are using the best of worlds. We are using Hadoop to generate complex recommendations but it's not we are serving and we are using Voldemort to serve kind of a static view of these recommendations, because it's great at that.
Floyd: What are some of the other interesting technologies in LinkedIn?
Anand: One of the other interesting technologies is data bus. Databus is a pub subsystem that solves the problem of change capture from relational databases. So imagine that you have all of your primary data with Oracle and you need to send that data to search indexes or graph indexes or to other Oracle replicas or just other intelligent systems in near realtime, and you want that data to be transmitted efficiently in order and also to group changes in transactions. So these are the problems that Databus solves. The unit of transfer or basically the message is effectively a transaction, so as a transaction executes in Oracle it is sent over to Databus relay and that relay can serve all the listeners all the subscribers very quickly with this change notifications. Another very unique feature of Databus is that unlike other pub/sub messaging systems it supports arbitrary long look-backs into the past. So when a new system comes online that didn't know about any of the changes that have happened over the last 6 months, what is it supposed to do? Databus also solves that problem. It will bootstrap that new server and then it will seamlessly hook it into the event stream so it's up to date and that server would never have a need to know about it.
Floyd: So this isn't something that other databases support.
Anand: Right. So this isn't something that has been solved yet and it's definitely not something that has been open source. I worked on a very similar system at Netflix, I didn't open source it and it wasn't this complicated actually. What it did do was it did transfer data in order reliably, but it didn't group them in transactions and it didn't support arbitrary long look-backs. Typically when you solve this problem of arbitrary long look-backs it actually taxes the source and that is kind of how I sort of solved it at Netflix but the way Databus does it is itself it will basically handle arbitrary long look-backs and it won't affect your source database at all.
Floyd: So just switching gears here you had talked about Kafka and I was wondering how that's being used and what is the kind of use case for Kafka?
Anand: Kafka is another pub/sub system, unlike Databus it delivers out of order. It is a really cool combination of design concepts similar to Dynamo, how Dynamo took a lot of the best-of-breed design concepts and put them together in one system, Kafka is similar in that respect. It's being used internally for high-traffic replication needs. There are some things that it doesn't do but it will do. So for example we use it to capture all the user events, this is called web tracking. So as a user if you start a search on LinkedIn or if you add a connection or if you accept a connection or if you view a page or if you add a job or update your profile in some way, all of these things are captured as events and shipped through Kafka to some BI system. Similarly we use Kafka for capturing all operational metrics on our fleet of a thousand servers and those metrics include CPU and TCP reconnects and resets and that sort of stuff. That is delivered in near real-time to monitoring systems that our Ops folks look at. So it needs to be pretty reliable. What does Kafka not do today? it's not fault tolerant, so if we lose a node, like a broker, that data is lost so number one on the roadmap for Kafka is supporting fault-tolerant replicas and that is something we should have fairly soon.
Floyd: You were just talking about working with a thousand nodes. So there is huge issue with scaling up the hardware there. What are you dealing with in terms of scaling up and how are you dealing with it?
Anand: Sure that is a great question. Oracle for example for writes, sometimes it's not possible to partition the data set and go on to multiple nodes. Sometimes all you can do is add data hardware and we are developing a new system called espresso which at its core supports things like data partitioning and boot strapping. Basically you've got some of the benefits of things like Dynamo. It can grow to handle higher traffic, it can serve some complex queries because it supports secondary indexing, currently which is Lucene-based, but it may be something different in the future. It also generates a change stream like Oracle does so we can basically generate a set of changes that goes downstream to the search index or graph index and BI. So it's going to be kind of a plugin replacement for some use cases that Oracle is used for. And again it's mostly about buying head room. We'll still keep Oracle around, but some of the big use cases will get moved off.
Floyd: So that is a good point. No one is going to be replacing their Oracle databases anytime soon, is that a fair statement?
Anand: It's kind of like in our ROI question, the amount of effort required versus... so there is this bang for your buck, at Netflix for example we move our biggest use cases, which we call the web scale use cases off of Oracle, but then there were other use cases that were not web scale and the effort involved in moving those was probably not worth the benefit, so we just left them on Oracle. In the future there might be some effort, but I don't really see that happening, then there is this other question about when you actually invest in Oracle database. If you save like 60% of a traffic, you've taken 60% of that traffic and moved it off of Oracle you'll end up still paying the same fee to Oracle. So you actually don't save much money, so that is kind of something you don't expect when moving things off of Oracle. So similarly at LinkedIn there are some use cases we've identified that if we move off of Oracle we will save a significant amount if CPU head room and resources and one of those is just like use for mail, like user to user mailing, that is currently a really big consumer on Oracle resources.
Floyd: So finally a lot of the technologies we've covered a lot of ground actually today are like Voldemort, Databus, espresso, some of these are open source, others aren't. Espresso you've mentioned is not open source at this point. What are the plans for that?
Anand: So currently Kafka and Voldemort are open source, Databus and espresso are not. Databus will likely be open source this year if all goes well. Espresso still needs to be rolled out internally, it's still under development, it might be a little longer before it's open source.
This interview originally appeared on InfoQ.com (Information Queue), an independent online community focused on change and innovation in enterprise software development and targeted primarily at the technical architect, technical team lead (senior developer), and project manager. InfoQ serves the Java, .NET, Ruby, SOA, and Agile communities with daily news written by domain experts, articles, video interviews, video conference presentations, and mini-books.