Up and Running with Secure Hadoop

July 21, 2010
By Gary Helmling

The development of security controls on top of Hadoop (see HADOOP-4487) is opening up new possibilities for the full software stack, including HBase as it’s adapted to take advantage of them. As the first step towards a full “secure HBase” implementation with role-based access control, I’ve been familiarizing myself with the security changes, and patching HBase to run with them. This first post in a series will focus on the fundamentals of installing and configuring a secure Hadoop test instance. The next post, to follow soon after, will take a closer look at the HBase changes required and plans for the full secure HBase implementation.

While security development has targeted the trunk of Hadoop, most people tend to be running the Hadoop 0.20 releases anywhere they care about, which makes these features a bit hard to test out. Enter the recent release of a 0.20-based security branch by Yahoo!, which greatly eases the pain of setting up a running test environment. Along with the work by Cloudera to incorporate security into their supported distribution (CDH3), this goes a long way towards getting security up on production clusters. For this post I’ll be running over setting up the prerequisites and performing the configuration needed for a secure Hadoop environment to illustrate the changes. If you don’t care about the details and just want to try it out, Yahoo! has also made available a self-contained VMWare image with secure Hadoop already installed and pre-configured. For the gory details, read on.

First of all, for full disclosure, I’m not affiliated with Yahoo! and have not participated in developing the secure Hadoop features. These notes are based on my own experience getting secure Hadoop up and running, but any inaccuracies are purely my own.

Why would you want to run secure Hadoop? In Hadoop 0.20 and prior, a client’s identity was passed as additional fields in the RPC calls to Hadoop services. Since the RPC calls themselves were not authenticated, any client with RPC access to the cluster could easily impersonate any user by either crafting a specific request or simply creating a local user account with the same name. Secure Hadoop embeds Kerberos to enforce authentication against a trusted central user database, and to enable mutual authentication for clients and servers of their remote endpoints. A shared secret (based on the user password) authenticates clients to the central server, which then issues a temporary session key (contained in the “ticket-granting ticket” or TGT). The TGT and session key are then used by the client to obtain additional tickets (authentication tokens) for accessing specific services. The service tickets allow both the client and the server to verify their endpoints against the Kerberos server, preventing impersonation. In addition to authentication, secure Hadoop negotiates RPC connections via SASL, which allows plugging in of alternate authentication mechanisms — Kerberos (GSSAPI) or DIGEST-MD5 (for either “delegation” tokens or job tokens for Map Reduce) currently — and encryption of communications for confidentiality, though this is not enabled by default.

Setting up Kerberos

Due to the Kerberos requirement, the first step is, naturally, setting up a local Kerberos KDC (key distribution center) for testing. The KDC maintains a database of user credentials, grants temporary cryptographic tokens (“ticket-granting tickets”) to clients upon successful authentication, and acts as a trusted third party in obtaining and verifying the service tickets passed between clients and servers. If you already have a functioning Kerberos environment, feel free to skip this section and make use of your own principals and/or domain in the following sections.

These instructions will be Debian/Ubuntu specific, but should be easy to adapt for other distributions. First, install the required packages:

sudo apt-get install krb5-admin-server

This should install krb5kdc (which grants tickets) and kadmin (which maintains the principal database), along with the Kerberos client utilities and libraries.

Next you’ll need to modify the default configuration files. Configure the following files with the contents shown:

/etc/krb5kdc/kdc.conf

[kdcdefaults]
    kdc_ports = 750,88

[realms]
    HADOOP.LOCALDOMAIN = {
        database_name = /var/lib/krb5kdc/principal
        admin_keytab = FILE:/etc/krb5kdc/kadm5.keytab
        acl_file = /etc/krb5kdc/kadm5.acl
        key_stash_file = /etc/krb5kdc/stash
        kdc_ports = 750,88
        max_life = 1d 0h 0m 0s
        max_renewable_life = 7d 0h 0m 0s
        master_key_type = des3-hmac-sha1
        supported_enctypes = des3-hmac-sha1:normal des-cbc-crc:normal des:normal des:v4 des:norealm des:onlyrealm
        default_principal_flags = +preauth
    }

/etc/krb5kdc/kadm5.acl

*/admin@HADOOP.LOCALDOMAIN    *

/etc/krb5.conf

[libdefaults]
    default_realm = HADOOP.LOCALDOMAIN
    ccache_type = 4
    forwardable = true
    proxiable = true
    renew_lifetime = 1d

[realms]
    HADOOP.LOCALDOMAIN = {
        kdc = localhost
        admin_server = localhost
        default_domain = localdomain
     }

[domain_realm]
    .localdomain = HADOOP.LOCALDOMAIN
    localdomain = HADOOP.LOCALDOMAIN
    .local = HADOOP.LOCALDOMAIN
    local = HADOOP.LOCALDOMAIN

[login]
krb4_convert = true
krb4_get_tickets = false

Now that we have the configuration setup, we need to initialize the Kerberos database. On Debian based systems, run:

sudo krb5_newrealm

on other distributions, just run the individual steps:

sudo kdb5_util create -s
sudo /etc/init.d/krb5kdc start
sudo /etc/init.d/krb5-admin-server start

Building and Installing Secure Hadoop

I’ve made available a pre-built copy of the Yahoo! secure Hadoop branch (0.20.104.2), which you can download. This version was built with the instructions below, except the Eclipse plugin was excluded due to compilation errors. If you would prefer to build secure Hadoop from source, follow the instructions below.

Secure Hadoop is available from the Yahoo! repository on github, which also includes extensive instructions on building and configuring secure Hadoop. This is a rebasing of the security features on Yahoo’s distribution of Hadoop 0.20. For HBase purposes, however, this is missing the critical data durability improvements back-ported in the Hadoop 0.20-append branch. For a version combining both of these changes, check out Andrew Purtell’s merged yahoo-hadoop-0.20.104-append branch (be warned this is an experimental branch and not well tested!). Both features are also being combined by Cloudera’s Todd Lipcon and slated for full support in CDH3.

The build steps don’t seem to have changed from previous Hadoop 0.20 releases, which basically consist of (from the Hadoop wiki):

  1. Install 32 and 64 bit versions of JDK 1.6
  2. Install JDK 1.5
  3. Install a recent Eclipse for the Eclipse plugin to build
  4. Install Xerces-C 2.8
  5. Build as follows
export JAVA_HOME=/path/to/32bit/jdk
export CFLAGS=-m32
export CXXFLAGS=-m32
ant -Dcompile.native=true -Dcompile.c++=true -Dlibhdfs=1 \
    -Dlibrecordio=true -Dxercescroot=/usr/lib/libxerces28 \
    -Declipse.home=/usr/lib/eclipse -Dforrest.home=/usr/local/forrest \
    -Djava5.home=/usr/local/jdk1.5 \
    clean api-report tar test-c++-libhdfs
export JAVA_HOME=/path/to/64bit/jdk
export CFLAGS=-m64
export CXXFLAGS=-m64
ant -Dcompile.native=true -Dcompile.c++=true compile-core-native compile-c++ tar

Note for Ubuntu 10.4

When building on Ubuntu 10.4 (the same may apply to other recent debian distros), I had to manually apply the fixes for HADOOP-5611 and MAPREDUCE-1251 in order for the C++ libraries to compile.

Finally extract the generated tarball (build/hadoop-0.20.104.2-SNAPSHOT.tar.gz) to a location of your choosing (hereafter referred to as HADOOP_HOME).

Creating Service Principals

Now that we have a functioning Kerberos environment, let’s create some users for running the Hadoop processes. The Hadoop processes will read their credentials from a keytab file when starting up, so we’ll also need to export the principals there. For simplicity I’m using a single keytab file for all processes. For enhanced security, you would want to: a) use separate principals for HDFS and MapReduce processes, b) ensure that each process only has access to the credentials it needs in a separate keytab file instead of using a shared file.

$ sudo su
# adduser hadoop
# adduser hclient --ingroup hadoop
# myhost=`hostname --fqdn`
# HADOOP_HOME=<path to hadoop>
# kadmin.local <<EOF
addprinc -randkey host/$myhost
addprinc -randkey hadoop/$myhost
ktadd -k $HADOOP_HOME/conf/nn.keytab host/$myhost
ktadd -k $HADOOP_HOME/conf/nn.keytab hadoop/$myhost
quit
EOF

We’ll also add a couple principals with manual passwords for testing (remember these!):

# kadmin.local
kadmin.local:  addprinc hadoop
kadmin.local:  addprinc hclient
kadmin.local:  quit

Finally, change the keytab file to be readable and writable only by the hadoop user:

$ sudo chown hadoop:hadoop $HADOOP_HOME/conf/nn.keytab
$ sudo chmod 600 $HADOOP_HOME/conf/nn.keytab

Testing Kerberos Authentication

Now that our Kerberos environment is (hopefully!) fully functional, let’s test it out. First obtain a ticket as the hclient user, using the password you entered earlier:

$ kinit hclient
Password for hclient@HADOOP.LOCALDOMAIN:
$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: hclient@HADOOP.LOCALDOMAIN

Valid starting     Expires            Service principal
07/19/10 10:13:57  07/19/10 20:13:57  krbtgt/HADOOP.LOCALDOMAIN@HADOOP.LOCALDOMAIN
 renew until 07/20/10 10:13:56

You should also be able to obtain a Kerberos ticket from the keytab file we generated:

$ sudo su hadoop
$ myhost=`hostname --fqdn`
$ kinit -k -t conf/nn.keytab hadoop/$myhost
$ klist
Ticket cache: FILE:/tmp/krb5cc_1001
Default principal: hadoop/...@HADOOP.LOCALDOMAIN

Valid starting     Expires            Service principal
07/19/10 10:19:07  07/19/10 20:19:07  krbtgt/HADOOP.LOCALDOMAIN@HADOOP.LOCALDOMAIN
 renew until 07/20/10 10:19:08

Note for MIT Kerberos 1.8.1+

The latest version of MIT Kerberos (included in the latest debian/ubuntu packages) contains a change that breaks Java parsing of the per-user credential cache file (/tmp/krb5cc_). The change adds an API for storing configuration data in the file with magic principal names. The latest JDK I’ve tested (1.6.0.20) is not aware of the configuration principal names, and so it breaks parsing the config entries. As a workaround, you can renew your krb ticket (kinit -R), which upon renewal rewrites the credential cache without the config entry:

$ kinit hclient
$ strings /tmp/krb5cc_`id -u` | grep "X-CACHECONF"
X-CACHECONF:
$ kinit -R
$ strings /tmp/krb5cc_`id -u` | grep "X-CACHECONF"
$

If you’re using an older version of Kerberos, you can ignore this step in the later tests (though it shouldn’t hurt either). In addition, if you’re working with a full Kerberos setup using pam_krb5 for login integration, it does not seem to generate the config entries in it’s credential cache files. If anyone has an alternate configuration for krb5.conf or another workaround, please let me know!

Configuring Secure Hadoop

Okay, it’s been a long road getting our Kerberos environment setup, but we’re finally ready to configure our Hadoop installation and start it up. Secure Hadoop adds a number of new configuration options to enable the security controls and integrate the Hadoop processes into the Kerberos environment.

conf/core-site.xml

First we need to enable the use of Kerberos authentication and access control checks, by adding the following to core-site.xml:

<property>
  <name>fs.default.name</name>
  <value>hdfs://<your host name>:9000/</value>
</property>
<property>
  <name>hadoop.security.authorization</name>
  <value>true</value>
</property>
<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value>
</property>
  • hadoop.security.authentication – valid entries are “simple” or “kerberos”. “simple” means the pre-existing functionality of assuming the identity of whatever user account is running the process locally (meaning any client with local superuser privileges can impersonate any user). “kerberos” enables Kerberos authentication along with SASL negotiation of remote connections.
  • hadoop.security.authorization – “true” or “false”. Enables use of the hadoop-policy.xml file to constrain access to Hadoop protocols/operations. Should be set to “true” for secure operation or any authenticated client will have full access to the cluster.

conf/hdfs-site.xml

Since we’re running in standalone mode, we’ll limit HDFS to a single replica of data blocks (for more detail on basic configuration see the Getting Started and Quick Start guides):

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

This is also where we actually secure the Hadoop processes by specifying the keytab files used to obtain the process credentials, and the Kerberos principals that each process should use to authenticate. Similar entries are used for each of the Hadoop processes (NameNode, SecondaryNameNode, DataNode):

<property>
  <name>dfs.block.access.token.enable</name>
  <value>true</value>
</property>
<!-- NameNode security config -->
<property>
  <name>dfs.https.port</name>
  <value>50475</value>
</property>
<property>
  <name>dfs.namenode.keytab.file</name>
  <value>HADOOP_HOME/conf/nn.keytab</value>
</property>
<property>
  <name>dfs.namenode.kerberos.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>dfs.namenode.kerberos.https.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<!-- SecondaryNameNode security config -->
<property>
  <name>dfs.secondary.https.port</name>
  <value>50495</value>
</property>
<property>
  <name>dfs.secondarynamenode.keytab.file</name>
  <value>HADOOP_HOME/conf/nn.keytab</value>
</property>
<property>
  <name>dfs.secondarynamenode.kerberos.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>dfs.secondarynamenode.kerberos.https.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<!-- DataNode security config -->
<property>
  <name>dfs.datanode.address</name>
  <value>0.0.0.0:1004</value>
</property>
<property>
 <name>dfs.datanode.http.address</name>
 <value>0.0.0.0:1006</value>
</property>
<property>
  <name>dfs.datanode.keytab.file</name>
  <value>HADOOP_HOME/conf/nn.keytab</value>
</property>
<property>
  <name>dfs.datanode.kerberos.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>dfs.datanode.kerberos.https.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
  • dfs.block.access.token.enable – “true” or “false”. To streamline block I/O between clients and DataNodes, secure Hadoop adds in the use of “block access tokens” in RPC requests between clients and DataNodes. These tokens encapsulate the file access authorizations granted to the client by the NameNode process (the only process with a view of the filesystem metadata), applied to each block composing the requested file. Block access tokens can be independently validated by DataNodes, keeping the block I/O operations distributed. This property must be enabled for secure HDFS operation.
  • dfs.(name|secondaryname|data)node.keytab.file – Path to the keytab file containing the credentials the named process will run as
  • dfs.(name|secondaryname|data)node.kerberos.principal – Name of the principal that the process will authenticate as, using the specified keytab file. In all of principal names, the “_HOST” string placeholder will be automatically replaced by the fully qualified hostname as the configuration is read at runtime.
  • dfs.(name|secondaryname|data)node.kerberos.https.principal – Name of the principal that the process’ web UI will authenticate as (again from the specified keytab file)

conf/mapred-site.xml

Like the HDFS processes, the JobTracker and TaskTracker map reduce processes must be configured with the credentials under which they will run. In addition, job and queue access permissions are configurable.

<property>
  <name>mapreduce.jobtracker.keytab.file</name>
  <value>HADOOP_HOME/conf/nn.keytab</value>
</property>
<property>
  <name>mapreduce.jobtracker.kerberos.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>mapreduce.jobtracker.kerberos.https.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>mapreduce.tasktracker.keytab.file</name>
  <value>HADOOP_HOME/conf/nn.keytab</value>
</property>
<property>
  <name>mapreduce.tasktracker.kerberos.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>mapreduce.tasktracker.kerberos.https.principal</name>
  <value>hadoop/_HOST@HADOOP.LOCALDOMAIN</value>
</property>

These configuration entries should now be easily recognizable based on the hdfs-site.xml entries above. For a more secure installation, we could split out the service principal names used for the map reduce processes (say “mapred/_HOST”), but here I’ve opted to re-use the same principals used for the HDFS processes for simplicity.

<property>
  <name>mapred.acls.enabled</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.job.acl-modify-job</name>
  <value></value>
</property>
<property>
  <name>mapreduce.job.acl-view-job</name>
  <value></value>
</property>
  • mapred.acls.enabled – “true” or “false”. Enables authorization checks for operations accessing or manipulating map reduce jobs.
  • mapreduce.job.acl-modify-job – list of users and groups in the format “user1,user2 group1,group2″ — comma separated list of usernames followed by a space followed by a comma separated list of group names. In addition the special values “*” (for allow all) and “” (for allow none) are recognized. This entry defines users/groups allowed to kill a job, kill a job task, and set job priorities. In addition to this list, the job owner, and any superusers always have these permissions.
  • mapreduce.job.acl-view-job – list of users and groups in the same format as above. This property controls access to potentially sensitive job data like counters and logs.

Configuring Secure DataNode

Securing the DataNode process now requires that it be started as the “root” user in order to bind to privileged ports (<1024). This provides a basic safeguard against unprivileged user code hijacking the data pipeline. The DataNode process is launched using the Apache JSVC native wrapper, which allows it to initialize the privileged resources then drop to running as a configured user account.

Install jsvc, and link it to HADOOP_HOME/bin, where the startup scripts expect to find it:

$ sudo apt-get install jsvc
# symlink into HADOOP_HOME/bin where hadoop scripts expect it
$ sudo ln -s /usr/bin/jsvc HADOOP_HOME/bin

Add the following line to conf/hadoop-env.sh to configure the user account used to run the non-privileged code:

export HADOOP_SECURE_DN_USER=hadoop

Starting Up

You should now be almost ready to start up the Hadoop processes. As a final step create a directory for logging and make it group writable so both the hadoop and hbase users can write to it:

$ cd HADOOP_HOME
$ mkdir logs
$ sudo chown hadoop:hadoop logs
$ sudo chmod 775 logs

Format HDFS:

$ sudo su hadoop
$ ./bin/hadoop namenode -format

Start up the HDFS processes:

$ sudo su hadoop   # if not already hadoop user
$ ./bin/hadoop-daemon.sh start namenode
$ ./bin/hadoop-daemon.sh start secondarynamenode
$ exit
$ sudo ./bin/hadoop-daemon.sh start datanode

Start up MapReduce processes:

$ sudo su hadoop
$ ./bin/hadoop-daemon.sh start jobtracker
$ ./bin/hadoop-daemon.sh start tasktracker

A Simple Test

To verify that everything is now working as expected, we can run a simple MapReduce test as the non-privileged hclient user. Since the hclient user does not have write permission to the HDFS root directory, we need to first prepare the directory structure:

$ cd HADOOP_HOME
$ sudo su hadoop
$ myhost=`hostname -fqdn`
$ kinit -k -t conf/nn.keytab hadoop/$myhost
$ ./bin/hadoop fs -mkdir /benchmarks
$ ./bin/hadoop fs -mkdir /user/hclient
$ ./bin/hadoop fs -chown hclient /benchmarks
$ ./bin/hadoop fs -chown hclient /user/hclient
$ exit
$ sudo su hclient
$ kinit
$ kinit -R    # to clear the krb ccache file
$ ./bin/hadoop jar hadoop-test-*.jar TestDFSIO -write -nrFiles 10 -fileSize 100
$ ./bin/hadoop jar hadoop-test-*.jar TestDFSIO -read -nrFiles 10 -fileSize 100
$ cat TestDFSIO_results.log 

----- TestDFSIO ----- : write
           Date &amp; time: ....
       Number of files: 10
Total MBytes processed: 1000
...
----- TestDFSIO ----- : read
           Date &amp; time: ....
       Number of files: 10
Total MBytes processed: 1000
...

Assuming everything completes successfully, this simply verifies that we’re able to read and write normally to HDFS as a non-privileged user. Now let’s test accessing a file where we do not have permission:

$ sudo su hadoop
$ kinit &amp;&amp; kinit -R
$ ./bin/hadoop fs -mkdir /user/hadoop
$ ./bin/hadoop fs -put hadoop-examples-*.jar /user/hadoop/hadoop-examples.jar
$ ./bin/hadoop fs -chmod 600 /user/hadoop/hadoop-examples.jar
$ exit
$ sudo su hclient
$ kinit &amp;&amp; kinit -R
$ ./bin/hadoop fs -get /user/hadoop/hadoop-examples.jar .
get: org.apache.hadoop.security.AccessControlException: Permission denied: user=hclient,
 access=READ, inode="hadoop-examples.jar":hadoop:supergroup:rw-------

Which correctly denies us access to the file.

Congratulations! If you’ve made it this far, you should now have a functioning secure Hadoop instance running locally! Go ahead and run some tests of your own to kick the tires. And if I’ve missed anything or you run into problems, please report back.

Further Discussion

I’ve only really touched on the access control list configurations in the descriptions above. You can configure more granular access policies by editing the conf/hadoop-policy.xml and conf/mapred-queue-acls.xml files. After you edit the hadoop-policy.xml file, you can reload the configuration on a running cluster with the “dfsadmin” command:

./bin/hadoop dfsadmin -refreshServiceAcl

As also alluded to above, I’ve omitted some potential configurations for the sake of simplicity. For enhanced security, you would want to split the HDFS processes and MapReduce processes to use separate principals and to read from individual keytab files (say nn.keytab, dn.keytab, jt.keytab, tt.keytab), so that each process only has access to it’s own credentials.

For a much better explanation of the design of the security implementation, see the excellent Hadoop security design document.

Access events get written to the logs/SecureAuth.audit file in the default configuration. For multi-user setups, this means that all test users must have write permission to the file. One way to accommodate this is to make all users members of the same group, and make the directory and file group writeable, with a user umask of 002. A better approach, however, would be to modify the log4j.properties file to user either syslog or an alternate appender for the audit log.

What this Means for HBase

As I mentioned initially, the ultimate goal is to build on top of secure Hadoop’s foundation and implement a “secure HBase” (see HBASE-1697), featuring role-based access control, audit logs, and data isolation. We’ve made some progress on the initial step of running HBase as an authenticated client of secure Hadoop, and authenticating HBase’s own RPC calls — watch for more details on this soon! The first phase will likely involve HBase interacting with HDFS as it’s own trusted principal and mediating authorization at the table or column family level, based on it’s own access-control lists. But the main implementation effort is just getting started and contributors are always welcome! If you have particular needs or considerations, please take part in the discussions on either the JIRA issue above or the HBase mailing lists.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

*