19 C
United States of America
Friday, February 28, 2025

Design patterns for implementing Hive Metastore for Amazon EMR on EKS


In fashionable information architectures, the necessity to handle and question huge datasets effectively, persistently, and precisely is paramount. For organizations that cope with massive information processing, managing metadata turns into a essential concern. That is the place Hive Metastore (HMS) can function a central metadata retailer, taking part in an important function in these fashionable information architectures.

HMS is a central repository of metadata for Apache Hive tables and different information lake desk codecs (for instance, Apache Iceberg), offering purchasers (reminiscent of Apache Hive, Apache Spark, and Trino) entry to this data utilizing the Metastore Service API. Over time, HMS has change into a foundational part for information lakes, integrating with a various ecosystem of open supply and proprietary instruments.

In non-containerized environments, there was sometimes just one strategy to implementing HMS—working it as a service in an Apache Hadoop cluster. With the appearance of containerization in information lakes via applied sciences reminiscent of Docker and Kubernetes, a number of choices for implementing HMS have emerged. These choices provide larger flexibility, permitting organizations to tailor HMS deployment to their particular wants and infrastructure.

On this submit, we’ll discover the structure patterns and display their implementation utilizing Amazon EMR on EKS with Spark Operator job submission sort, guiding you thru the complexities that will help you select the very best strategy in your use case.

Resolution overview

Previous to Hive 3.0, HMS was tightly built-in with Hive and different Hadoop ecosystem parts. Hive 3.0 launched a Standalone Hive Metastore. This new model of HMS capabilities as an impartial service, decoupled from different Hive and Hadoop parts reminiscent of HiveServer2. This separation permits numerous functions, reminiscent of Apache Spark, to work together straight with HMS with out requiring a full Hive and Hadoop setting set up. You may be taught extra about different parts of Apache Hive on the Design web page.

On this submit, we’ll use a Standalone Hive Metastore for instance the structure and implementation particulars of varied design patterns. Any reference to HMS refers to a Standalone Hive Metastore.

The HMS broadly consists of two major parts:

  • Backend database: The database is a persistent information retailer that holds all of the metadata, reminiscent of desk schemas, partitions, and information areas.
  • Metastore service API: The Metastore service API is a stateless service that manages the core performance of the HMS. It handles learn and write operations to the backend database.

Containerization and Kubernetes provides numerous structure and implementation choices for HMS, together with, working:

On this submit, we’ll use Apache Spark as the info processing framework to display these three architectural patterns. Nevertheless, these patterns aren’t restricted to Spark and could be utilized to any information processing framework, reminiscent of Hive or Trino, that depends on HMS for managing metadata and accessing catalog data.

Be aware that in a Spark software, the motive force is liable for querying the metastore to fetch desk schemas and areas, then distributes this data to the executors. Executors course of the info utilizing the areas offered by the motive force, by no means needing to question the metastore straight. Therefore, within the three patterns described within the following sections, solely the motive force communicates with the HMS, not the executors.

HMS as sidecar container

On this sample, HMS runs as a sidecar container throughout the similar pod as the info processing framework, reminiscent of Apache Spark. This strategy makes use of Kubernetes multi-container pod performance, permitting each HMS and the info processing framework to function collectively in the identical pod. The next determine illustrates this structure, the place the HMS container is a part of Spark driver pod.

HMS as sidecar container

This sample is suited to small-scale deployments the place simplicity is the precedence. As a result of HMS is co-located with the Spark driver, it reduces community overhead and gives a simple setup. Nevertheless, it’s vital to notice that on this strategy HMS operates completely throughout the scope of the father or mother software and isn’t accessible by different functions. Moreover, row conflicts may come up when a number of jobs try and insert information into the identical desk concurrently. To handle this, it is best to ensure that no two jobs are writing to the identical desk concurrently.

Take into account this strategy in case you want a fundamental structure. It’s very best for organizations the place a single group manages each the info processing framework (for instance, Apache Spark) and HMS, and there’s no want for different functions to make use of HMS.

Cluster devoted HMS

On this sample, HMS runs in a number of pods managed via a Kubernetes deployment, sometimes inside a devoted namespace in the identical information processing EKS cluster. The next determine illustrates this setup, with HMS decoupled from Spark driver pods and different workloads.

Cluster dedicated HMS

This sample works effectively for medium-scale deployments the place reasonable isolation is sufficient, and compute and information wants could be dealt with inside a number of clusters. It gives a steadiness between useful resource effectivity and isolation, making it very best to be used circumstances the place scaling metadata companies independently is vital, however full decoupling isn’t obligatory. Moreover, this sample works effectively when a single group manages each the info processing frameworks and HMS, guaranteeing streamlined operations and alignment with organizational obligations.

By decoupling HMS from Spark driver pods, it could actually serve a number of purchasers, reminiscent of Apache Spark and Trino, whereas sharing cluster assets. Nevertheless, this strategy may result in useful resource rivalry in periods of excessive demand, which could be mitigated by implementing tenant isolation on HMS pods.

Exterior HMS

On this structure sample, HMS is deployed in its personal EKS cluster deployed utilizing Kubernetes deployment and uncovered as a Kubernetes Service utilizing AWS Load Balancer Controller, separate from the info processing clusters. The next determine illustrates this setup, the place HMS is configured as an exterior service, separate from the info processing clusters.

External HMS

This sample fits situations the place you need a centralized metastore service shared throughout a number of information processing clusters. HMS permits totally different information groups to handle their very own information processing clusters whereas counting on the shared metastore for metadata administration. By deploying HMS in a devoted EKS cluster, this sample gives most isolation, impartial scaling, and the flexibleness to function and managed as its personal impartial service.

Whereas this strategy provides clear separation of considerations and the flexibility to scale independently, it additionally introduces greater operational complexity and doubtlessly elevated prices due to the necessity to handle a further cluster. Take into account this sample when you’ve got strict compliance necessities, want to make sure full isolation for metadata companies, or need to present a unified metadata catalog service for a number of information groups. It really works effectively in organizations the place totally different groups handle their very own information processing frameworks and depend on a shared metadata retailer for information processing wants. Moreover, the separation permits specialised groups to deal with their respective areas.

Deploy the answer

Within the the rest of this submit, you’ll discover the implementation particulars for every of the three structure patterns, utilizing EMR on EKS with Spark Operator job submission sort for example to display their implementation. Be aware that this implementation hasn’t been examined with different EMR on EKS Spark job submission varieties. You’ll start by deploying the widespread parts that function the inspiration for all of the structure patterns. Subsequent, you’ll deploy the parts particular to every sample. Lastly, you’ll execute Spark jobs to connect with the HMS implementation distinctive to every sample and confirm the profitable execution and retrieval of information and metadata.

To streamline the setup course of, we’ve automated the deployment of widespread infrastructure parts so you possibly can deal with the important elements of every HMS structure. We’ll present detailed data that will help you perceive every step, simplifying the setup whereas preserving the training expertise.

State of affairs

To showcase the patterns, you’ll create three clusters:

  • Two EMR on EKS clusters: analytics-cluster and datascience-cluster
  • An EKS cluster: hivemetastore-cluster

Each analytics-cluster and datascience-cluster function information processing clusters that run Spark workloads, whereas the hivemetastore-cluster hosts the HMS.

You’ll use analytics-cluster for instance the HMS as sidecar and cluster devoted sample. You’ll use all three clusters to display the exterior HMS sample.

Supply code

You could find the codebase within the AWS Samples GitHub repository.

Conditions

Earlier than you deploy this resolution, ensure that the next conditions are in place:

Arrange widespread infrastructure

Start by establishing the infrastructure parts which are widespread to all three architectures.

  1. Clone the repository to your native machine and set the 2 setting variables. Exchange <AWS_REGION> with the AWS Area the place you need to deploy these assets.
git clone https://github.com/aws-samples/sample-emr-eks-hive-metastore-patterns.git
cd sample-emr-eks-hive-metastore-patterns
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>

  1. Execute the next script to create the shared infrastructure.
cd ${REPO_DIR}/setup
./setup.sh

  1. To confirm profitable infrastructure deployment, navigate to the AWS Administration Console for AWS CloudFormation, choose your stack, and examine the Occasions, Sources, and Outputs tabs for completion standing, particulars, and record of assets created.

You may have accomplished the setup of the widespread parts that function the inspiration for all architectures. You’ll now deploy the parts particular to every structure and execute Apache Spark jobs to validate the profitable implementation.

HMS in a sidecar container

To implement HMS utilizing the sidecar container sample, the Spark software requires setting each sidecar and catalog properties within the job configuration file.

  1. Execute the next script to configure the analytics-cluster for sidecar sample. For this submit, we saved the HMS database credentials right into a Kubernetes Secret object. We advocate utilizing Kubernetes Exterior Secrets and techniques Operator to fetch HMS database credentials from AWS Secrets and techniques Supervisor.
cd ${REPO_DIR}/hms-sidecar
./configure-hms-sidecar.sh analytics-cluster

  1. Evaluate the Spark job manifest file spark-hms-sidecar-job.yaml. This file was created by substituting variables within the spark-hms-sidecar-job.tpl template within the earlier step. The next samples spotlight key sections of the manifest file.
spec:
  driver:
  ...
    sidecars:
      # Hive Metastore Sidecar container
      - identify: hive-metastore
        picture: ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/hive-metastore:newest
        env:
          # These settings configure metastore to make use of Amazon Postgres RDS as backend database, connecting to it by way of jdbc URL.
          - identify: HIVE_METASTORE_DB_TYPE
            worth: postgres
          - identify: HIVE_METASTORE_DB_DRIVER
            worth: org.postgresql.Driver
          - identify: HIVE_METASTORE_DB_URL
            worth: jdbc:postgresql://${HMS_RDS_PROXY_ENDPOINT}:5432/hivemetastore
          # The warehouse location is specified as an S3 bucket
          - identify: HIVE_METASTORE_WAREHOUSE_LOC
            worth: s3a://${S3_BUCKET_NAME}/warehouse
          - identify: AWS_REGION
            worth: ${AWS_REGION}
          # The database username and password are handed by way of setting variables. The password is retrieved from a Kubernetes secret
          - identify: HIVE_METASTORE_DB_USER
            worth: hive_metastore_user
          - identify: HIVE_METASTORE_DB_PASSWORD
            valueFrom:
              secretKeyRef:
                identify: hms-rds-password
                key: HIVE_METASTORE_DB_PASSWORD

Spark job configuration

spec:  
  sparkConf:
    # Hive Catalog properties
    # Units spark to make use of Hive because the catalog for SQL operations
    spark.sql.catalogImplementation: "hive"
    # HMS is uncovered on localhost:8080 for the reason that sidecar runs in the identical pod. Spark connects to the sidecar by way of this URI
    spark.hadoop.hive.metastore.uris: "thrift://localhost:9083"
    # The information location is ready to S3 bucket
    spark.sql.warehouse.dir: "s3a://${S3_BUCKET_NAME}/warehouse"
    # Retrieve short-term AWS credentials by assuming a job utilizing an online identification token
    spark.hadoop.fs.s3a.aws.credentials.supplier: "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Configure Spark to make use of S3A filesystem
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

Submit the Spark job and confirm the HMS as sidecar container setup

On this sample, you’ll submit Spark jobs in analytics-cluster. The Spark jobs will connect with the HMS service working as a sidecar container within the driver pod.

  1. Run the Spark job to confirm that the setup was profitable.
kubectl apply -f spark-hms-sidecar-job.yaml

  1. Describe the sparkapplication object.
kubectl get sparkapplication -n emr
kubectl describe sparkapplication spark-hms-sidecar-job --namespace emr

  1. Record the pods and observe the variety of containers hooked up to the motive force pod. Wait till the Standing adjustments from ContainerCreating to Working (ought to take only a few seconds).
  1. View the motive force logs to validate the output.
kubectl logs spark-hms-sidecar-job-driver --namespace emr

  1. In the event you encounter the next error, watch for a couple of minutes and rerun the earlier command.
Error from server (BadRequest): container "spark-kubernetes-driver" in pod "spark-hms-sidecar-driver" is ready to start out: ContainerCreating

  1. After profitable completion of the job, you see the next message within the logs. The tabular output efficiently validates the setup of HMS as a sidecar container.
...
24/09/17 21:44:00 INFO metastore: Making an attempt to connect with metastore with URI thrift://localhost:9083
24/09/17 21:44:00 INFO metastore: Opened a connection to metastore, present connections: 1
24/09/17 21:44:00 INFO metastore: Related to metastore.
...
...
+-------+---+-------+---+
|sample|id |identify   |age|
+-------+---+-------+---+
|Sidecar|1 |Alice   |30 |
|Sidecar|2 |Bob     |25 |
|Sidecar|3 |Charlie |35 |
+-------+---+-------+---+

Cluster devoted HMS

To implement HMS utilizing a cluster devoted HMS sample, the Spark software requires establishing HMS URI and catalog properties within the job configuration file.

  1. Execute the next script to configure the analytics-cluster for cluster devoted sample.
cd ${REPO_DIR}/hms-cluster-dedicated
./configure-hms-cluster-dedicated.sh analytics-cluster

  1. Confirm the HMS deployment by itemizing the pods and viewing the logs. No Java exceptions within the logs confirms that the Hive Metastore service is working efficiently.
kubectl get pods --namespace hive-metastore
kubectl logs <HMS-PODNAME> --namespace hive-metastore

  1. Evaluate the Spark job manifest file, spark-hms-cluster-dedicated-job.yaml. This file is created by substituting variables within the spark-hms-cluster-dedicated-job.tpl template within the earlier step. The next pattern highlights key sections of the manifest file.
spec: 
  sparkConf:
    # Hive Catalog properties
    # Units spark to make use of Hive because the catalog for SQL operations
    spark.sql.catalogImplementation: "hive"
    # HMS is working in a pod and we will connect with it in the identical EKS cluster by way of this URI
    spark.hadoop.hive.metastore.uris: "thrift://hive-metastore-svc.hive-metastore.svc.cluster.native:9083"
    # The information location is ready to S3 bucket
    spark.sql.warehouse.dir: "s3a://${S3_BUCKET_NAME}/warehouse"
    # Retrieve short-term AWS credentials by assuming a job utilizing an online identification token
    spark.hadoop.fs.s3a.aws.credentials.supplier: "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Configure Spark to make use of S3A filesystem
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

Submit the Spark job and confirm the cluster devoted HMS setup

On this sample, you’ll submit Spark jobs in analytics-cluster. The Spark jobs will connect with the HMS service in the identical information processing EKS cluster.

  1. Submit the job.
kubectl apply -f spark-hms-cluster-dedicated-job.yaml -n emr

  1. Confirm the standing.
kubectl get sparkapplication -n emr
kubectl describe sparkapplication spark-hms-cluster-dedicated-job --namespace emr

  1. Describe driver pod and observe the variety of containers hooked up to the motive force pod. Wait till the standing adjustments from ContainerCreating to Working (ought to take only a few seconds).
  1. View the motive force logs to validate the output.
kubectl logs spark-hms-cluster-dedicated-job-driver --namespace emr

  1. After profitable completion of the job, it is best to see the next message within the logs. The tabular output efficiently validates the setup of cluster devoted HMS.
...
24/09/17 21:44:00 INFO metastore: Making an attempt to connect with metastore with URI thrift://hive-metastore-svc.hive-metastore.svc.cluster.native:9083
24/09/17 21:44:00 INFO metastore: Opened a connection to metastore, present connections: 1
24/09/17 21:44:00 INFO metastore: Related to metastore.
...
...
+-----------------+---+-------+---+
|sample          |id |identify   |age|
+-----------------+---+-------+---+
|Cluster Devoted|1  |Alice  |30 |
|Cluster Devoted|2  |Bob    |25 |
|Cluster Devoted|3  |Charlie|35 |
+-----------------+---+-------+---+

Exterior HMS

To implement an exterior HMS sample, the Spark software requires establishing an HMS URI for the service endpoint uncovered by hivemetastore-cluster.

  1. Execute the next script to configure hivemetastore-cluster for Exterior HMS sample.
cd ${REPO_DIR}/hms-external
./configure-hms-external.sh

  1. Evaluate the Spark job manifest file spark-hms-external-job.yaml. This file is created by substituting variables within the spark-hms-external-job.tpl template in the course of the setup course of. The next pattern highlights key sections of the manifest file.
spec:
  sparkConf:
    # Hive Catalog properties
    # Units spark to make use of Hive because the catalog for SQL operations
    spark.sql.catalogImplementation: "hive"
    # HMS is working in a cluster and we will connect with it within the EKS cluster by way of this URI
    spark.hadoop.hive.metastore.uris: "thrift://${HMS_URI_ENDPOINT}:9083"
    # The information location is ready to S3 bucket
    spark.sql.warehouse.dir: "s3a://${S3_BUCKET_NAME}/warehouse"
    # Retrieve short-term AWS credentials by assuming a job utilizing an online identification token
    spark.hadoop.fs.s3a.aws.credentials.supplier: "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Configure Spark to make use of S3A filesystem
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

Submit the Spark job and confirm the HMS in a separate EKS cluster setup

To confirm the setup, submit Spark jobs in analytics-cluster and datascience-cluster. The Spark jobs will connect with the HMS service within the hivemetastore-cluster.

Use the next steps for analytics-cluster after which for datascience-cluster to confirm that each clusters can connect with the HMS on hivemetastore-cluster.

  1. Run the spark job to check the profitable setup. Exchange <CONTEXT_NAME> with Kubernetes context for analytics-cluster after which for datascience-cluster.
kubectl config use-context <CONTEXT_NAME>
kubectl apply -f spark-hms-external-job.yaml -n emr

  1. Describe the sparkapplication object.
kubectl get sparkapplication spark-hms-external-job -n emr
kubectl describe sparkapplication spark-hms-external-job --namespace emr

  1. Record the pods and observe the variety of containers hooked up to the motive force pod. Wait till the standing adjustments from ContainerCreating to Working (ought to take only a few seconds).
  1. View the motive force logs to validate the output on the info processing cluster.
kubectl logs spark-hms-external-job-driver --namespace emr

  1. The output ought to appear to be the next. The tabular output efficiently validates the setup of HMS in a separate EKS cluster.
After profitable completion of the job, it is best to be capable to see the under message within the logs.
...
24/09/17 21:44:00 INFO metastore: Making an attempt to connect with metastore with URI  thrift://k8s-hivemeta-hmsexter-xxxxxx.elb.us-east-1.amazonaws.com:9083
24/09/17 21:44:00 INFO metastore: Opened a connection to metastore, present connections: 1
24/09/17 21:44:00 INFO metastore: Related to metastore.
...
...
+--------+---+-------+---+
|sample |id |identify   |age|
+--------+---+-------+---+
|Exterior|1  |Alice  |30 |
|Exterior|2  |Bob    |25 |
|Exterior|3  |Charlie|35 |
+--------+---+-------+---+

Clear up

To keep away from incurring future fees from the assets created on this tutorial, clear up your setting after you’ve accomplished the steps. You are able to do this by working the cleanup.sh script, which is able to safely take away all of the assets provisioned in the course of the setup.

cd ${REPO_DIR}/setup
./cleanup.sh

Conclusion

On this submit, we’ve explored the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, every providing distinct benefits relying in your necessities. Whether or not you select to deploy HMS as a sidecar container throughout the Apache Spark Driver pod, or as a Kubernetes deployment within the information processing EKS cluster, or as an exterior HMS service in a separate EKS cluster, the important thing issues revolve round communication effectivity, scalability, useful resource isolation, excessive availability, and safety.

We encourage you to experiment with these patterns in your individual setups, adapting them to suit your distinctive workloads and operational wants. By understanding and making use of these design patterns, you possibly can optimize your Hive Metastore deployments for efficiency, scalability, and safety in your EMR on EKS environments. Discover additional by deploying the answer in your AWS account and share your experiences and insights with the group.


Concerning the Authors

Avinash Desireddy is a Cloud Infrastructure Architect at AWS, captivated with constructing safe functions and information platforms. He has intensive expertise in Kubernetes, DevOps, and enterprise structure, serving to prospects containerize functions, streamline deployments, and optimize cloud-native environments.

Suvojit Dasgupta is a Principal Information Architect at AWS. He leads a group of expert engineers in designing and constructing scalable information options for AWS prospects. He focuses on growing and implementing modern information architectures to deal with advanced enterprise challenges.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles