-8 C
United States of America
Wednesday, January 15, 2025

Batch information ingestion into Amazon OpenSearch Service utilizing AWS Glue


Organizations continuously work to course of and analyze huge volumes of information to derive actionable insights. Efficient information ingestion and search capabilities have turn into important to be used instances like log analytics, utility search, and enterprise search. These use instances demand a sturdy pipeline that may deal with excessive information volumes and allow environment friendly information exploration.

Apache Spark, an open supply powerhouse for large-scale information processing, is widely known for its velocity, scalability, and ease of use. Its potential to course of and rework huge datasets has made it an indispensable software in trendy information engineering. Amazon OpenSearch Service—a community-driven search and analytics answer—empowers organizations to go looking, mixture, visualize, and analyze information seamlessly. Collectively, Spark and OpenSearch Service provide a compelling answer for constructing highly effective information pipelines. Nevertheless, ingesting information from Spark into OpenSearch Service can current challenges, particularly with numerous information sources.

This put up showcases the way to use Spark on AWS Glue to seamlessly ingest information into OpenSearch Service. We cowl batch ingestion strategies, share sensible examples, and focus on greatest practices that will help you construct optimized and scalable information pipelines on AWS.

Overview of answer

AWS Glue is a serverless information integration service that simplifies information preparation and integration duties for analytics, machine studying, and utility growth. On this put up, we deal with batch information ingestion into OpenSearch Service utilizing Spark on AWS Glue.

AWS Glue gives a number of integration choices with OpenSearch Service utilizing varied open supply and AWS managed libraries, together with:

Within the following sections, we discover every integration methodology intimately, guiding you thru the setup and implementation. As we progress, we incrementally construct the structure diagram proven within the following determine, offering a transparent path for creating sturdy information pipelines on AWS. Every implementation is impartial of the others. We selected to showcase them individually, as a result of in a real-world state of affairs, solely one of many three integration strategies is probably going for use.

Image showing the high level architecture diagram

You will discover the code base within the accompanying GitHub repo. Within the following sections, we stroll by way of the steps to implement the answer.

Conditions

Earlier than you deploy this answer, make sure that the next conditions are in place:

Clone the repository to your native machine

Clone the repository to your native machine and set the BLOG_DIR surroundings variable. All of the relative paths assume BLOG_DIR is ready to the repository location in your machine. If BLOG_DIR just isn’t getting used, regulate the trail accordingly.

git clone git@github.com:aws-samples/opensearch-glue-integration-patterns.git
cd opensearch-glue-integration-patterns
export BLOG_DIR=$(pwd)

Deploy the AWS CloudFormation template to create the required infrastructure

The primary focus of this put up is to show the way to use the talked about libraries in Spark on AWS Glue to ingest information into OpenSearch Service. Although we middle on this core matter, a number of key AWS parts will have to be pre-provisioned for the mixing examples, similar to a Amazon Digital Personal Cloud (Amazon VPC), a number of Subnets, an AWS Key Administration Service (AWS KMS) key, an Amazon Easy Storage Service (Amazon S3) bucket, an AWS Glue position, and an OpenSearch Service cluster with domains for OpenSearch Service and Elasticsearch. To simplify the setup, we’ve automated the provisioning of this core infrastructure utilizing the cloudformation/opensearch-glue-infrastructure.yaml AWS CloudFormation template.

  1. Run the next instructions

The CloudFormation template will deploy the required networking parts (similar to VPC and subnets), Amazon CloudWatch logging, AWS Glue position, and OpenSearch Service and Elasticsearch domains required to implement the proposed structure. Use a robust password (8–128 characters, three of that are lowercase, uppercase, numbers, or particular characters, and no /, “, or areas) and cling to your group’s safety requirements for ESMasterUserPassword and OSMasterUserPassword within the following command:

cd ${BLOG_DIR}/cloudformation/
aws cloudformation deploy 
--template-file ${BLOG_DIR}/cloudformation/opensearch-glue-infrastructure.yaml 
--stack-name GlueOpenSearchStack 
--capabilities CAPABILITY_NAMED_IAM 
--region <AWS_REGION> 
--parameter-overrides 
ESMasterUserPassword=<ES_MASTER_USER_PASSWORD> 
OSMasterUserPassword=<OS_MASTER_USER_PASSWORD>

It’s best to see successful message similar to "Efficiently created/up to date stack – GlueOpenSearchStack" after the assets have been provisioned efficiently. Provisioning this CloudFormation stack usually takes roughly half-hour to finish.

  1. On the AWS CloudFormation console, find the GlueOpenSearchStack stack, and ensure that its standing is CREATE_COMPLETE.

Image showing the "CREATE_COMPLETE" status of cloudformation template

You possibly can assessment the deployed assets on the Assets tab, as proven within the following screenshot.The screenshot doesn’t show all of the created assets.

Image showing the "Resources" tab of cloudformation template

Further setup steps

On this part, we gather important info, together with the S3 bucket identify and the OpenSearch Service and Elasticsearch area endpoints. These particulars are required for executing the code in subsequent sections.

Seize the main points of the provisioned assets

Use the next AWS CLI command to extract and save the output values from the CloudFormation stack to a file named GlueOpenSearchStack_outputs.txt. We seek advice from the values on this file in upcoming steps.

aws cloudformation describe-stacks 
--stack-name GlueOpenSearchStack 
--query 'sort_by(Stacks[0].Outputs[], &OutputKey)[].{Key:OutputKey,Worth:OutputValue}' 
--output desk 
--no-cli-pager 
--region <AWS_REGION> > ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Obtain NY Inexperienced Taxi December 2022 dataset and replica to S3 bucket

The aim of this put up is to show the technical implementation of ingesting information into OpenSearch Service utilizing AWS Glue. Understanding the dataset itself just isn’t important, other than its information format, which we focus on in AWS Glue notebooks in later sections. To study extra in regards to the dataset, you could find extra info on the NYC Taxi and Limousine Fee web site.

We particularly request that you simply obtain the December 2022 dataset, as a result of we’ve got examined the answer utilizing this specific dataset:

S3_BUCKET_NAME=$(awk -F '|' '$2 ~ /S3Bucket/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt)
mkdir -p ${BLOG_DIR}/datasets && cd ${BLOG_DIR}/datasets
curl -O https://d37ci6vzurychx.cloudfront.internet/trip-data/green_tripdata_2022-12.parquet
aws s3 cp green_tripdata_2022-12.parquet s3://${S3_BUCKET_NAME}/datasets/green_tripdata_2022-12.parquet

Obtain the required JARs from the Maven repository and replica to S3 bucket

We’ve specified a selected JAR file model to make sure secure deployment expertise. Nevertheless, we suggest adhering to your group’s safety greatest practices and reviewing any recognized vulnerabilities within the model of the JAR information earlier than deployment. AWS doesn’t assure the safety of any open-source code used right here. Moreover, please confirm the downloaded JAR file’s checksum in opposition to the printed worth to verify its integrity and authenticity.

mkdir -p ${BLOG_DIR}/jars && cd ${BLOG_DIR}/jars
# OpenSearch Service jar
curl -O https://repo1.maven.org/maven2/org/opensearch/consumer/opensearch-spark-30_2.12/1.0.1/opensearch-spark-30_2.12-1.0.1.jar
aws s3 cp opensearch-spark-30_2.12-1.0.1.jar s3://${S3_BUCKET_NAME}/jars/opensearch-spark-30_2.12-1.0.1.jar
# Elasticsearch jar
curl -O https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/7.17.23/elasticsearch-spark-30_2.12-7.17.23.jar
aws s3 cp elasticsearch-spark-30_2.12-7.17.23.jar s3://${S3_BUCKET_NAME}/jars/elasticsearch-spark-30_2.12-7.17.23.jar

Within the following sections, we implement the person information ingestion strategies as outlined within the structure diagram.

Ingest information into OpenSearch Service utilizing the OpenSearch Spark library

On this part, we load an OpenSearch Service index utilizing Spark and the OpenSearch Spark library. We show this implementation through the use of AWS Glue notebooks, using primary authentication utilizing person identify and password.

To show the ingestion mechanisms, we’ve got supplied the Spark-and-OpenSearch-Code-Steps.ipynb pocket book with detailed directions. Observe the steps on this part at the side of the directions within the pocket book.

Arrange the AWS Glue Studio pocket book

Full the next steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Underneath Create job, select Pocket book.

Image showing AWS console page for AWS Glue to open notebook

  1. Add the pocket book file situated at ${BLOG_DIR}/glue_jobs/Spark-and-OpenSearch-Code-Steps.ipynb.
  2. For IAM position, select the AWS Glue job IAM position that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

  1. Enter a reputation for the pocket book (for instance, Spark-and-OpenSearch-Code-Steps) and select Save.

Image showing AWS Glue OpenSearch Notebook

Exchange the placeholder values within the pocket book

Full the next steps to replace the placeholders within the pocket book:

  1. In Step 1 within the pocket book, substitute the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection identify. You may get the identify of the interactive session by executing the next command:
cd ${BLOG_DIR}
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

  1. In Step 1 within the pocket book, substitute the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket identify. You may get the identify of the S3 bucket by executing the next command:
awk -F '|' '$2 ~ /S3Bucket/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

  1. In Step 4 within the pocket book, substitute <OPEN-SEARCH-DOMAIN-WITHOUT-HTTPS> with the OpenSearch Service area identify. You may get the area identify by executing the next command:
awk -F '|' '$2 ~ /OpenSearchDomainEndpoint/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the pocket book

Run every cell of the pocket book to load information into the OpenSearch Service area and browse it again to confirm the profitable load. Check with the detailed directions inside the pocket book for execution-specific steerage.

Spark write modes (append vs. overwrite)

It is suggested to jot down information incrementally into OpenSearch Service indexes utilizing the append mode, as demonstrated in Step 8 within the pocket book. Nevertheless, in sure instances, it’s possible you’ll have to refresh the complete dataset within the OpenSearch Service index. In these situations, you need to use the overwrite mode, although it isn’t suggested for giant indexes. When utilizing overwrite mode, the Spark library deletes rows from the OpenSearch Service index one after the other after which rewrites the information, which might be inefficient for giant datasets. To keep away from this, you’ll be able to implement a preprocessing step in Spark to determine insertions and updates, after which write the information into OpenSearch Service utilizing append mode.

Ingest information into Elasticsearch utilizing the Elasticsearch Hadoop library

On this part, we load an Elasticsearch index utilizing Spark and the Elasticsearch Hadoop Library. We show this implementation through the use of AWS Glue because the engine for Spark.

Arrange the AWS Glue Studio pocket book

Full the next steps to arrange the pocket book:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Underneath Create job, select Pocket book.

Image showing AWS console page for AWS Glue to open notebook

  1. Add the pocket book file situated at ${BLOG_DIR}/glue_jobs/Spark-and-Elasticsearch-Code-Steps.ipynb.
  2. For IAM position, select the AWS Glue job IAM position that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

  1. Enter a reputation for the pocket book (for instance, Spark-and-ElasticSearch-Code-Steps) and select Save.

Image showing AWS Glue Elasticsearch Notebook

Exchange the placeholder values within the pocket book

Full the next steps:

  1. In Step 1 within the pocket book, substitute the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection identify. You may get the identify of the interactive session by executing the next command:
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

  1. In Step 1 within the pocket book, substitute the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket identify. You may get the identify of the S3 bucket by executing the next command:
awk -F '|' '$2 ~ /S3Bucket/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

  1. In Step 4 within the pocket book, substitute <ELASTIC-SEARCH-DOMAIN-WITHOUT-HTTPS> with the Elasticsearch area identify. You may get the area identify by executing the next command:
awk -F '|' '$2 ~ /ElasticsearchDomainEndpoint/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the pocket book

Run every cell within the pocket book to load information to the Elasticsearch area and browse it again to confirm the profitable load. Check with the detailed directions inside the pocket book for execution-specific steerage.

Ingest information into OpenSearch Service utilizing the AWS Glue OpenSearch Service connection

On this part, we load an OpenSearch Service index utilizing Spark and the AWS Glue OpenSearch Service connection.

Create the AWS Glue job

Full the next steps to create an AWS Glue Visible ETL job:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Underneath Create job, select Visible ETL

It will open the AWS Glue job visible editor.Image showing AWS console page for AWS Glue to open Visual ETL

  1. Select the plus signal, and beneath Sources, select Amazon S3.

Image showing AWS console page for AWS Glue Visual Editor

  1. Within the visible editor, select the Knowledge Supply – S3 bucket node.
  2. Within the Knowledge supply properties – S3 pane, configure the information supply as follows:
    • For S3 supply sort, choose S3 location.
    • For S3 URL, select Browse S3, and select the green_tripdata_2022-12.parquet file from the designated S3 bucket.
    • For Knowledge format, select Parquet.
  1. Select Infer schema to let AWS Glue detect the schema of the information.

It will arrange your information supply from the required S3 bucket.

Image showing AWS console page for AWS Glue Visual Editor

  1. Select the plus signal once more so as to add a brand new node.
  2. For Transforms, select Drop Fields to incorporate this transformation step.

It will let you take away any pointless fields out of your dataset earlier than loading it into OpenSearch Service.

Image showing AWS console page for AWS Glue Visual Editor

  1. Select the Drop Fields rework node, then choose the next fields to drop from the dataset:
    • payment_type
    • trip_type
    • congestion_surcharge

It will take away these fields from the information earlier than it’s loaded into OpenSearch Service.

Image showing AWS console page for AWS Glue Visual Editor

  1. Select the plus signal once more so as to add a brand new node.
  2. For Targets, select Amazon OpenSearch Service.

It will configure OpenSearch Service because the vacation spot for the information being processed.

Image showing AWS console page for AWS Glue Visual Editor

  1. Select the Knowledge goal – Amazon OpenSearch Service node and configure it as follows:
    • For Amazon OpenSearch Service connection, select the connection GlueOpenSearchServiceConnec-* from the drop down.
    • For Index, enter green_taxi. The green_taxi index was created earlier within the “Ingest information into OpenSearch Service utilizing the OpenSearch Spark library” part.

This configures the OpenSearch Service to jot down the processed information to the required index.

Image showing AWS console page for AWS Glue Visual Editor

  1. On the Job particulars tab, replace the job particulars as follows:
    • For Identify, enter a reputation (for instance, Spark-and-Glue-OpenSearch-Connection).
    • For Description, enter an non-obligatory description (for instance, AWS Glue job utilizing Glue OpenSearch Connection to load information into Amazon OpenSearch Service).
    • For IAM position, select the position beginning with GlueOpenSearchStack-GlueRole-*.
    • For the Glue model, select Glue 4.0 – Helps spark 3.3, Scala 2, Python 3
    • Go away the remainder of the fields as default.
    • Select Save to save lots of the modifications.

Image showing AWS console page for AWS Glue Visual Editor

  1. To run the AWS Glue job Spark-and-Glue-OpenSearch-Connector, select Run.

It will provoke the job execution.

Image showing AWS console page for AWS Glue Visual Editor

  1. Select the Runs tab and watch for the AWS Glue job to finish efficiently.

You will notice the standing change to Succeeded when the job is full.

Image showing AWS console page for AWS Glue job run status

Clear up

To scrub up your assets, full the next steps:

  1. Delete the CloudFormation stack:
aws cloudformation delete-stack 
--stack-name GlueOpenSearchStack 
--region <AWS_REGION>

  1. Delete the AWS Glue jobs:
    • On the AWS Glue console, beneath ETL jobs within the navigation pane, select Visible ETL.
    • Choose the roles you created (Spark-and-Glue-OpenSearch-Connector, Spark-and-ElasticSearch-Code-Steps, and Spark-and-OpenSearch-Code-Steps) and on the Actions menu, select Delete.

Conclusion

On this put up, we explored a number of methods to ingest information into OpenSearch Service utilizing Spark on AWS Glue. We demonstrated the usage of three key libraries: the AWS Glue OpenSearch Service connection, the OpenSearch Spark Library, and the Elasticsearch Hadoop Library. The strategies outlined on this put up might help you streamline your information ingestion into OpenSearch Service.

In case you’re keen on studying extra and getting hands-on expertise, we’ve created a workshop that walks you thru the complete course of intimately. You possibly can discover the total setup for ingesting information into OpenSearch Service, dealing with each batch and real-time streams, and constructing dashboards. Take a look at the workshop Unified Actual-Time Knowledge Processing and Analytics Utilizing Amazon OpenSearch and Apache Spark to deepen your understanding and apply these strategies step-by-step.


Concerning the Authors

Ravikiran Rao is a Knowledge Architect at Amazon Net Providers and is keen about fixing advanced information challenges for varied prospects. Exterior of labor, he’s a theater fanatic and newbie tennis participant.

Vishwa Gupta is a Senior Knowledge Architect with the AWS Skilled Providers Analytics Observe. He helps prospects implement massive information and analytics options. Exterior of labor, he enjoys spending time with household, touring, and attempting new meals.

Suvojit Dasgupta is a Principal Knowledge Architect at Amazon Net Providers. He leads a crew of expert engineers in designing and constructing scalable information options for AWS prospects. He makes a speciality of creating and implementing revolutionary information architectures to deal with advanced enterprise challenges.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles