Hadoop Security - Hadoop

Early versions of Hadoop assumed that HDFS and MapReduce clusters would be used by a group of cooperating users within a secure environment. The measures for restricting access were designed to prevent accidental data loss, rather than to prevent unauthorized access to data. For example, the file permissions system in HDFS prevents one user from accidentally wiping out the whole filesystem from a bug in a program,or by mistakenly typing hadoop fs -rmr /, but it doesn’t prevent a malicious user from assuming root’s identity (see “Setting User Identity” ) to access or delete any data in the cluster.

In security parlance, what was missing was a secure authentication mechanism to assure Hadoop that the user seeking to perform an operation on the cluster is who they claim to be and therefore trusted. HDFS file permissions provide only a mechanism for authorization, which controls what a particular user can do to a particular file. For example, a file may only be readable by a group of users, so anyone not in that groupis not authorized to read it. However, authorization is not enough by itself, since the system is still open to abuse via spoofing by a malicious user who can gain network access to the cluster.

It’s common to restrict access to data that contains personally identifiable information (such as an end user’s full name or IP address) to a small set of users (of the cluster) within the organization, who are authorized to access such information. Less sensitive (or anonymized) data may be made available to a larger set of users. It is convenient to host a mix of datasets with different security levels on the same cluster (not least becauseit means the datasets with lower security levels can be shared). However, to meet regulatory requirements for data protection, secure authentication must be in place for shared clusters.

This is the situation that Yahoo! faced in 2009, which led a team of engineers there to implement secure authentication for Hadoop. In their design, Hadoop itself does not manage user credentials, since it relies on Kerberos, a mature open-source network authentication protocol, to authenticate the user. In turn, Kerberos doesn’t manage permissions. Kerberos says that a user is who they say they are; it’s Hadoop’s job todetermine whether that user has permission to perform a given action. There’s a lot to Kerberos, so here we only cover enough to use it in the context of Hadoop, referring readers who want more background to Kerberos: The Definitive Guide by Jason Garman (O’Reilly, 2003).

Which Versions of Hadoop Support Kerberos Authentication?

Kerberos for authentication was added after the 0.20 series of releases of Apache Hadoop.

However, it was not complete at the time of the first 0.21 release, so it will not be generally available and stable until the 0.22 release series. Alternatively, you can find it in the 0.20.S Yahoo! Distribution of Hadoop. The same security support will also be available in Cloudera’s first stable CDH3 release.

Kerberos and Hadoop

At a high level, there are three steps that a client must take to access a service when using Kerberos, each of which involves a message exchange with a server:

  1. Authentication. The client authenticates itself to the Authentication Server and receives a timestamped Ticket-Granting Ticket (TGT).
  2. Authorization. The client uses the TGT to request a service ticket from the Ticket Granting Server.
  3. Service Request. The client uses the service ticket to authenticate itself to the server that is providing the service the client is using. In the case of Hadoop, this might be the namenode or the jobtracker.

Together, the Authentication Server and the Ticket Granting Server form the Key Distribution Center (KDC). The process is shown graphically in Figure. The three-step Kerberos ticket exchange protocol

Kerberos and Hadoop

The authorization and service request steps are not user-level actions: the client performs these steps on the user’s behalf. The authentication step, however, is normally carried out explicitly by the user using the kinit command, which will prompt for a password. However, this doesn’t mean you need to enter your password every time you run a job or access HDFS, since TGTs last for 10 hours by default (and can berenewed for up to a week). It’s common to automate authentication at operating system login time, thereby providing single sign-on to Hadoop.

In cases where you don’t want to be prompted for a password (for running an unattended MapReduce job, for example), you can create a Kerberos keytab file using the ktutil command. A keytab is a file that stores passwords and may be supplied to kinit with the -t option.

An example

Let’s look at an example of the process in action. The first step is to enable Kerberos authentication by setting the hadoop.security.authentication property in coresite.

xml to kerberos. The default setting is simple, which signifies that the old backwards-compatible (but insecure) behavior of using the operating system user name to determine identity should be employed.

We also need to enable service-level authorization by setting hadoop.security.author ization to true in the same file. You may configure Access Control Lists (ACLs) in the hadoop-policy.xml configuration file to control which users and groups have permission to connect to each Hadoop service. Services are defined at the protocol level, so there are ones for MapReduce job submission, namenode communication, and so on. By default, all ACLs are set to *, which means that all users have permission to access each service, but on a real cluster you should lock the ACLs down to only those users and groups that should have access.

The format for an ACL is a comma-separated list of usernames, followed by whitespace, followed by a comma-separated list of group names. For example, the ACL preston,howard directors,inventors would authorize access to users named preston or howard, or in groups directors or inventors.

With Kerberos authentication turned on, let’s see what happens when we try to copy a local file to HDFS:

10/07/03 15:44:58 WARN ipc.Client: Exception encountered while connecting to the server: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSEx ception: No valid credentials provided (Mechanism level: Failed to find any Ker beros tgt)]

‖ To use Kerberos authentication with Hadoop, you need to install, configure, and run a KDC (Hadoop does not come with one). Your organization may already have a KDC you can use (an Active Directory installation, for example); if not, you can set up an MIT Kerberos 5 KDC using the instructions in the Linux Security Cookbook (O’Reilly, 2003). For getting started with Hadoop security, consider using Yahoo!’s Hadoop 0.20.S Virtual Machine Appliance, which includes a local KDC as well as a pseudo-distributed Hadoop cluster.

Bad connection to FS. command aborted. exception: Call to localhost/127.0.0.1:80 20 failed on local exception: java.io.IOException: javax.security.sasl.SaslExcep tion: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] The operation fails, since we don’t have a Kerberos ticket. We can get one by authenticating to the KDC, using kinit:

And we see that the file is successfully written to HDFS. Notice that even though we carried out two filesystem commands, we only needed to call kinit once, since the Kerberos ticket is valid for 10 hours (use the klist command to see the expiry time of your tickets and kdestroy to invalidate your tickets). After we get a ticket, everything works just as normal.

Delegation Tokens

In a distributed system like HDFS or MapReduce, there are many client-server interactions, each of which must be authenticated. For example, an HDFS read operation will involve multiple calls to the namenode and calls to one or more datanodes. Instead of using the three-step Kerberos ticket exchange protocol to authenticate each call, which would present a high load on the KDC on a busy cluster, Hadoop uses delegationtokens to allow later authenticated access without having to contact the KDC again.

Delegation tokens are created and used transparently by Hadoop on behalf of users, so there’s no action you need to take as a user over using kinit to sign in, however it’s useful to have a basic idea of how they are used.

A delegation token is generated by the server (the namenode in this case), and can be thought of as a shared secret between the client and the server. On the first RPC call to the namenode, the client has no delegation token, so it uses Kerberos to authenticate, and as a part of the response it gets a delegation token from the namenode. In subsequent calls, it presents the delegation token, which the namenode can verify (since itgenerated it using a secret key), and hence the client is authenticated to the server.

When it wants to perform operations on HDFS blocks, the client uses a special kind of delegation token, called a block access token, that the namenode passes to the client in response to a metadata request. The client uses the block access token to authenticate itself to datanodes. This is possible only because the namenode shares its secret key used to generate the block access token with datanodes (which it sends in heartbeat messages), so that they can verify block access tokens. Thus, an HDFS block may only be accessed by a client with a valid block access token from a namenode. This closes the security hole in unsecured Hadoop where only the block ID was needed to gain access to a block. This property is enabled by setting dfs.block.access.token.enable to true.

In MapReduce, job resources and metadata (such as JAR files, input splits, configuration files) are shared in HDFS for the jobtracker to access, and user code runs on the tasktrackers and accesses files on HDFS (the process is explained in “Anatomy of a MapReduce Job Run” ). Delegation tokens are used by the jobtracker and tasktrackers to access HDFS during the course of the job. When the job has finished, the delegation tokens are invalidated.

Delegation tokens are automatically obtained for the default HDFS instance, but if your job needs to access other HDFS clusters, then you can have the delegation tokens for these loaded by setting the mapreduce.job.hdfs-servers job property to a commaseparated list of HDFS URIs.

Other Security Enhancements

Security has been tightened throughout HDFS and MapReduce to protect against unauthorized access to resources.# The more notable changes are listed here:

  • Tasks can be run using the operating system account for the user who submitted the job, rather than the user running the tasktracker. This means that the operating system is used to isolate running tasks, so they can’t send signals to each other (to kill another user’s tasks, for example), and so local information, such as task data, is kept private via local file system permissions.This feature is enabled by setting mapred.task.tracker.task-controller to org.apache.hadoop.mapred.LinuxTaskController.* In addition, administrators need to ensure that each user is given an account on every node in the cluster (typically using LDAP).

  • When tasks are run as the user who submitted the job, the distributed cache (“Distributed Cache” ) is secure: files that are world-readable are put in a shared cache (the insecure default), otherwise they go in a private cache, only readable by the owner.

  • Users can view and modify only their own jobs, not others. This is enabled by setting mapred.acls.enabled to true. There are two job configuration properties, mapreduce.job.acl-view-job and mapreduce.job.acl-modify-job, which may be set to a comma-separated list of users to control who may view or modify a particular job.

  • The shuffle is secure, preventing a malicious user from requesting another user’s map outputs.

#At the time of writing, other projects like HBase and Hive had not been integrated with this security model.

* LinuxTaskController uses a setuid executable called task-controller found in the bin directory. You should ensure that this binary is owned by root and has the setuid bit set (with chmod +s).

  • When appropriately configured, it’s no longer possible for a malicious user to run a rogue secondary namenode, datanode, or tasktracker that can join the cluster and potentially compromise data stored in the cluster. This is enforced by requiring daemons to authenticate with the master node they are connecting to. To enable this feature, you first need to configure Hadoop to use a keytab previouslygenerated with the ktutil command. For a datanode, for example, you would set the dfs.datanode.keytab.file property to the keytab filename and dfs.data node.kerberos.principal to the username to use for the datanode. Finally, the ACL for the DataNodeProtocol (which is used by datanodes to communicate with the namenode) must be set in hadoop-policy.xml, by restricting security.datanode.pro tocol.acl to the datanode’s username.

  • A datanode may be run on a privileged port (one lower than 1024), so a client may be reasonably sure that it was started securely.

  • A task may only communicate with its parent tasktracker, thus preventing an attacker from obtaining MapReduce data from another user’s job. One area that hasn’t yet been addressed in the security work is encryption: neither RPC nor block transfers are encrypted. HDFS blocks are not stored in an encrypted form either. These features are planned for a future release, and in fact, encrypting the data stored in HDFS could be carried out in existing versions of Hadoop by the application itself (by writing an encryption CompressionCodec, for example).


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics