Follow Datanami:
June 10, 2020

LinkedIn Open Sources Kube2Hadoop

Hadoop and Kubernetes have fundamentally different ways of authenticating users, exposing a security gap for organizations that want to access HDFS data from Kubernetes-based applications. Thanks to the new Kube2Hadoop tool that was released as open source by LinkedIn today, closing that security gap gets a little easier.

LinkedIn runs one of the biggest traditional YARN-based Hadoop clusters in the world, with more than 4,500 users and over 500 PB of data. Over the years, the Microsoft-owned business has adopted other frameworks for its AI workloads, including Kubernetes, which started out as home for its Jupyter notebooks but has since spread.

The company doesn’t seem to be interested in moving its Hadoop assets away from YARN and adopting Kubernetes, which has become an increasingly popular scheduler for cloud-based Hadoop and Spark deployments. That means that its traditional (i.e. YARN- and HDFS-based) Hadoop environment would need to live in harmony with new applications that LinkedIn is running on Kubernetes.

However, there’s a problem.

“By default, there is a gap between the security model of Kubernetes and Hadoop,” write LinkedIn co-authors Cong Gu, Abin Shahab, Chen Qiang, and Keqiu Hu in a blog post entitled “Open sourcing Kube2Hadoop: Secure access to HDFS from Kubernetes.”

Hadoop uses Kerberos, a three-party protocol built on symmetric key cryptography, the LinkedIn authors explain. To avoid frequent authentication checks, the Hadoop community introduced a lightweight two­party authentication method called delegation tokens to complement Kerberos authentication. These delegation tokens have a lifespan of one day and can be renewed for up to seven days, they say.

“Kubernetes, on the other hand, uses a certificate-based approach for authentication, and does not expose the owner of a job in any of its public-facing APIs,” the LinkedIn authors write. “Therefore, it is not possible to securely determine the authorized user from within the pod using the native Kubernetes API and then use that username to fetch the Hadoop delegation token for HDFS access.”

The workflow of LinkedIn’s Kube2Hadoop tool

To bridge this security gap, LinkedIn developed a piece of technology called Kube2Hadoop that integrates Kubernetes’ authentication method with the Hadoop delegation tokens. The Kube2Hadoop solutions does this in a manner that respects the fine-grained role-based access control (RBAC) that it has implemented to control user access to its extensive Hadoop resources.

Kube2Hadoop consists of three components:

  1. A Kubernetes-resident Hadoop token service for fetching delegation tokens;
  2. An init container in each worker pod that functions as a client for sending requests to fetch a delegation token from the Hadoop token service;
  3. An IDDecorator that writes an authenticated user-ID deployed as a Kubernetes admission controller.

Kube2Hadoop allows people who are working in a Kubernetes environment, such as data scientists developing machine learning algorithms in Juypter notebooks, to access data from HDFS without compromising security. Here’s how it works:

The process starts when a Hadoop user logs into a Hadoop gateway using their existing Hadoop password. The user receives a credential from the client authentication service, which she uses to submit a job on the Hadoop gateway to the Kubernetes cluster. The Kubernetes cluster then authenticates the user with the certificate and launches the container as requested.

The Kube2Hadoop init containers, which are attached to each of the worker containers that require HDFS access, then send requests to the Hadoop token service for the delegation token. When returned, the token is then mounted in the container, providing the user with secure access to data residing in HDFS. A watch job cancels the token when the job is done, and renews tokens for long-running jobs, LinkedIn says.

LinkedIn says the Kube2HDFS authentication mechanism is resistant to specific attacks, including the use of a fake user names in job annotation and Kubernetes pods. However, to prevent a malicious Kubernetes administrator from gaining unrestricted access to HDFS, LinkedIn recommends separating the Hadoop token service, which contains the superuser keytab, out of the Kubernetes platform. “We also suggest blacklisting user/group accounts in Kube2Hadoop that have superuser access to HDFS,” the co-authors write.

For more information, check out the blog post that describes the Kube2Hadoop tool in greater detail. The Kube2Hadoop tool can be downloaded from LinkedIn’s GitHub repository.

Related Items:

Re-Imagining Big Data in a Post-Hadoop World

Dr. Elephant Steps Up to Cure Hadoop Cluster Pains

LinkedIn Diagnostics Help Tune Hadoop Jobs