Running Spark on Kubernetes with Remote Hive Metastore and S3 Warehouse
Introduction
In today’s data-driven world, efficient and reliable data processing frameworks like Apache Spark are in high demand. When combined with the power of Kubernetes, a leading container orchestration platform, and the versatility of AWS S3 for data storage, Spark becomes an even more powerful tool.
In this tutorial, we’ll explore how to run Spark on Kubernetes, using a remote Hive Metastore for managing metadata and S3 as our data warehouse.
Prerequisites
This tutorial assumes that you have a basic understanding of Apache Spark, Hive, Kubernetes, and AWS S3. You should also have access to a Kubernetes cluster and an AWS account with the necessary permissions to create and access S3 buckets.
The Setup
Our setup consists of a Spark application running on a Kubernetes cluster, a Hive Metastore also deployed on the same cluster, and an S3 bucket for storing our data. The Spark application interacts with the Hive Metastore for metadata management and directly with S3 for data read/write operations.
Setting up Hive Metastore on Kubernetes
To set up the Hive Metastore, we’ll need a Docker image that runs the Metastore service and a Kubernetes deployment configuration. The Metastore service connects to a relational database for storing metadata. Here’s the simple Kubernetes configuration for our Hive Metastore:
This YAML file describes a Kubernetes Pod for metastore-standalone
.
- Init Containers: This container is responsible for downloading the necessary dependencies for our application. The container uses the
busybox:1.28
image and runs a shell command to download thehadoop-aws
andaws-java-sdk-bundle
JAR files from Maven repository. - Main Container: The main container runs the
apache/hive:3.1.3
image. This container is configured to run the Hive Metastore service. The Metastore service manages metadata for Hive tables and partitions. - Environment Variables: We’ve set several environment variables for the Hive Metastore service. The
SERVICE_NAME
is set tometastore
. TheAWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
are also set, which will be used for accessing AWS S3. - Command: The main command for the container first moves the downloaded JAR files from the
/jars
directory to/opt/hadoop/share/hadoop/common
directory. Then it initializes the Metastore schema using Derby as the database. Finally, it runs the Hive Metastore service.
Note: The Hive Metastore service in this setup uses a Local/Embedded Metastore Database (Derby). This is fine for testing or development purposes. However, for production use, it is recommended to use a Remote Metastore Database. The Metastore service supports several types of databases, including MySQL, Postgres, Oracle, and MS SQL Server. You would need to adjust the schematool
command and provide additional configurations to use a remote database.
Configuring Spark to Use the Remote Hive Metastore and S3 as a Warehouse
To configure Spark to use our remote Hive Metastore, we need to provide certain configurations when submitting our Spark application. Specifically, we need to point Spark to our Hive Metastore service and provide the necessary credentials. Additionally to use S3 as our data warehouse, we need to provide Spark with our AWS credentials and the S3 bucket name. This can be done using Spark configurations. We also need to include the hadoop-aws
package, which allows Spark to interact with S3.
Here’s an example of a spark-submit
command:
This spark-submit
command runs a Spark application in a Kubernetes cluster.
- S3 Access: The configurations
spark.hadoop.fs.s3a.access.key
andspark.hadoop.fs.s3a.secret.key
provide the access keys needed to interact with AWS S3. Thespark.hadoop.fs.s3a.endpoint
configuration sets the endpoint to access the S3 service. - Performance Optimizations: This command contains several recommended optimizations for S3, including enabling fast upload (
spark.hadoop.fs.s3a.fast.upload
), using thedirectory
committer (spark.hadoop.fs.s3a.committer.name
), and more. - Hive Metastore: The
spark.sql.catalogImplementation
configuration is set tohive
, which means the application will use Hive's catalog implementation. Thespark.hadoop.hive.metastore.uris
configuration sets the URI for connecting to the Hive Metastore. - Parquet Configurations: There are several configurations related to Parquet, which is a columnar storage file format. These configurations tweak the behavior of the Spark application when it reads or writes Parquet files.
- Kubernetes-specific Configurations: The configurations starting with
spark.kubernetes
are specific to running Spark on Kubernetes. They specify the Docker image for the Spark application, the path to upload application files, the Pod template files for the driver and executor Pods, and more.
Leverage Spark API with Hive Tables Stored in S3
One of the key advantages of this setup is the ability to utilize the general Spark API with Hive tables, which are stored in S3, without needing additional configuration within your code. This provides the flexibility and ease of using Spark’s powerful transformations and actions directly on data stored in Hive tables.
Consider the following scenario: you have a DataFrame that you want to write into a Hive partitioned table. With Spark, Hive, and S3 integrated, you can use the DataFrame API to accomplish this:
In this case, the partitioned column partitionColumn
will be moved to the end of the schema. The data gets stored in an S3 bucket acting as the warehouse. You can then seamlessly read from this Hive table stored in S3 using Spark's SQL interface:
This straightforward interaction between Spark, Hive, and S3 simplifies your data processing tasks and enhances your ability to manage and analyze big data. It’s a testament to the power of combining these technologies in a unified data platform.
Common Issues and Troubleshooting
When deploying Apache Hive on Kubernetes, you might run into some challenges. In this section, we will address two main issues that are commonly encountered when using the apache/hive:3.1.3
Docker image for the Hive Metastore service, and provide solutions to overcome them.
Issue 1: Bug in the Docker Image Entrypoint
The first issue lies in the entrypoint of the Docker image. The default command specified in the image may not work as expected due to a bug.
Solution: The most effective workaround for this is to overwrite the default command with a custom one. We specify a new command in the containers
section of our Kubernetes configuration. This command initializes the schema for the Derby database (which we use as the Metastore database in our case) and then starts the Hive Metastore service.
Issue 2: Missing AWS JARs and Download Utilities
The second issue is that the default Docker image for Hive does not include AWS-related JAR files which are crucial for connecting the Metastore service to an S3 bucket. Moreover, the image does not contain utilities to download these JARs.
Solution: To solve this problem, we add an init container in our Kubernetes configuration. This init container is based on a busybox
image which contains utilities for downloading files. We use this container to download the necessary AWS JAR files from a Maven repository and store them in a shared volume. We then add these JARs to the classpath of our main Hive Metastore container.
These solutions help us establish a Hive Metastore service on Kubernetes that can connect to an S3 bucket as the warehouse location.
Conclusion
In this tutorial, we’ve explored how to run Spark on Kubernetes with a remote Hive Metastore and S3 as our data warehouse. This setup provides a scalable and efficient solution for running big data applications. With the power of Kubernetes, we can easily scale our Spark application to meet our needs. And with the versatility of S3, we can store large amounts of data at a low cost.
I hope you found this tutorial helpful. If you have any questions or comments, please feel free to leave them below. I’d love to hear about your experience with running Spark on Kubernetes, or any insights you might have about working with Hive and S3.
References and Further Reading
For more in-depth information about the topics covered in this tutorial, I recommend the following resources:
- Running Spark on Kubernetes
- Integration with Cloud Infrastructures
- Hadoop-AWS module: Integration with Amazon Web Services
- AdminManual Metastore 3.0 Administration
- Run Apache Hive inside docker container in pseudo-distributed mode
Stay tuned for more posts on Big Data, Spark, Kubernetes, and more!