Running Spark on Kubernetes with Remote Hive Metastore and S3 Warehouse

Published in

Towards Dev

6 min readMay 13, 2023

Introduction

In today’s data-driven world, efficient and reliable data processing frameworks like Apache Spark are in high demand. When combined with the power of Kubernetes, a leading container orchestration platform, and the versatility of AWS S3 for data storage, Spark becomes an even more powerful tool.

In this tutorial, we’ll explore how to run Spark on Kubernetes, using a remote Hive Metastore for managing metadata and S3 as our data warehouse.

Prerequisites

This tutorial assumes that you have a basic understanding of Apache Spark, Hive, Kubernetes, and AWS S3. You should also have access to a Kubernetes cluster and an AWS account with the necessary permissions to create and access S3 buckets.

The Setup

Our setup consists of a Spark application running on a Kubernetes cluster, a Hive Metastore also deployed on the same cluster, and an S3 bucket for storing our data. The Spark application interacts with the Hive Metastore for metadata management and directly with S3 for data read/write operations.

Setting up Hive Metastore on Kubernetes

To set up the Hive Metastore, we’ll need a Docker image that runs the Metastore service and a Kubernetes deployment configuration. The Metastore service connects to a relational database for storing metadata. Here’s the simple Kubernetes configuration for our Hive Metastore:

This YAML file describes a Kubernetes Pod for metastore-standalone.

Init Containers: This container is responsible for downloading the necessary dependencies for our application. The container uses the busybox:1.28 image and runs a shell command to download the hadoop-aws and aws-java-sdk-bundle JAR files from Maven repository.
Main Container: The main container runs the apache/hive:3.1.3 image. This container is configured to run the Hive Metastore service. The Metastore service manages metadata for Hive tables and partitions.
Environment Variables: We’ve set several environment variables for the Hive Metastore service. The SERVICE_NAME is set to metastore. The AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are also set, which will be used for accessing AWS S3.
Command: The main command for the container first moves the downloaded JAR files from the /jars directory to /opt/hadoop/share/hadoop/common directory. Then it initializes the Metastore schema using Derby as the database. Finally, it runs the Hive Metastore service.

Note: The Hive Metastore service in this setup uses a Local/Embedded Metastore Database (Derby). This is fine for testing or development purposes. However, for production use, it is recommended to use a Remote Metastore Database. The Metastore service supports several types of databases, including MySQL, Postgres, Oracle, and MS SQL Server. You would need to adjust the schematool command and provide additional configurations to use a remote database.

Configuring Spark to Use the Remote Hive Metastore and S3 as a Warehouse

To configure Spark to use our remote Hive Metastore, we need to provide certain configurations when submitting our Spark application. Specifically, we need to point Spark to our Hive Metastore service and provide the necessary credentials. Additionally to use S3 as our data warehouse, we need to provide Spark with our AWS credentials and the S3 bucket name. This can be done using Spark configurations. We also need to include the hadoop-aws package, which allows Spark to interact with S3.

Here’s an example of a spark-submit command:

This spark-submit command runs a Spark application in a Kubernetes cluster.

S3 Access: The configurations spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key provide the access keys needed to interact with AWS S3. The spark.hadoop.fs.s3a.endpoint configuration sets the endpoint to access the S3 service.
Performance Optimizations: This command contains several recommended optimizations for S3, including enabling fast upload (spark.hadoop.fs.s3a.fast.upload), using the directory committer (spark.hadoop.fs.s3a.committer.name), and more.
Hive Metastore: The spark.sql.catalogImplementation configuration is set to hive, which means the application will use Hive's catalog implementation. The spark.hadoop.hive.metastore.uris configuration sets the URI for connecting to the Hive Metastore.
Parquet Configurations: There are several configurations related to Parquet, which is a columnar storage file format. These configurations tweak the behavior of the Spark application when it reads or writes Parquet files.
Kubernetes-specific Configurations: The configurations starting with spark.kubernetes are specific to running Spark on Kubernetes. They specify the Docker image for the Spark application, the path to upload application files, the Pod template files for the driver and executor Pods, and more.

Leverage Spark API with Hive Tables Stored in S3

One of the key advantages of this setup is the ability to utilize the general Spark API with Hive tables, which are stored in S3, without needing additional configuration within your code. This provides the flexibility and ease of using Spark’s powerful transformations and actions directly on data stored in Hive tables.

Consider the following scenario: you have a DataFrame that you want to write into a Hive partitioned table. With Spark, Hive, and S3 integrated, you can use the DataFrame API to accomplish this:

In this case, the partitioned column partitionColumn will be moved to the end of the schema. The data gets stored in an S3 bucket acting as the warehouse. You can then seamlessly read from this Hive table stored in S3 using Spark's SQL interface:

This straightforward interaction between Spark, Hive, and S3 simplifies your data processing tasks and enhances your ability to manage and analyze big data. It’s a testament to the power of combining these technologies in a unified data platform.

Common Issues and Troubleshooting

When deploying Apache Hive on Kubernetes, you might run into some challenges. In this section, we will address two main issues that are commonly encountered when using the apache/hive:3.1.3 Docker image for the Hive Metastore service, and provide solutions to overcome them.

Issue 1: Bug in the Docker Image Entrypoint

The first issue lies in the entrypoint of the Docker image. The default command specified in the image may not work as expected due to a bug.

Solution: The most effective workaround for this is to overwrite the default command with a custom one. We specify a new command in the containers section of our Kubernetes configuration. This command initializes the schema for the Derby database (which we use as the Metastore database in our case) and then starts the Hive Metastore service.

Issue 2: Missing AWS JARs and Download Utilities

The second issue is that the default Docker image for Hive does not include AWS-related JAR files which are crucial for connecting the Metastore service to an S3 bucket. Moreover, the image does not contain utilities to download these JARs.

Solution: To solve this problem, we add an init container in our Kubernetes configuration. This init container is based on a busybox image which contains utilities for downloading files. We use this container to download the necessary AWS JAR files from a Maven repository and store them in a shared volume. We then add these JARs to the classpath of our main Hive Metastore container.

These solutions help us establish a Hive Metastore service on Kubernetes that can connect to an S3 bucket as the warehouse location.

Conclusion

In this tutorial, we’ve explored how to run Spark on Kubernetes with a remote Hive Metastore and S3 as our data warehouse. This setup provides a scalable and efficient solution for running big data applications. With the power of Kubernetes, we can easily scale our Spark application to meet our needs. And with the versatility of S3, we can store large amounts of data at a low cost.

I hope you found this tutorial helpful. If you have any questions or comments, please feel free to leave them below. I’d love to hear about your experience with running Spark on Kubernetes, or any insights you might have about working with Hive and S3.

References and Further Reading

For more in-depth information about the topics covered in this tutorial, I recommend the following resources:

Stay tuned for more posts on Big Data, Spark, Kubernetes, and more!