On this put up, we present methods to use Amazon Kinesis Information Streams to buffer and combination real-time streaming information for supply into Amazon OpenSearch Service domains and collections utilizing Amazon OpenSearch Ingestion. You need to use this method for quite a lot of use instances, from real-time log analytics to integrating software messaging information for real-time search. On this put up, we give attention to the use case for centralizing log aggregation for a corporation that has a compliance must archive and retain its log information.
Kinesis Information Streams is a completely managed, serverless information streaming service that shops and ingests numerous streaming information in actual time at any scale. For log analytics use instances, Kinesis Information Streams enhances log aggregation by decoupling producer and shopper functions, and offering a resilient, scalable buffer to seize and serve log information. This decoupling offers benefits over conventional architectures. As log producers scale up and down, Kinesis Information Streams may be scaled dynamically to persistently buffer log information. This prevents load modifications from impacting an OpenSearch Service area, and offers a resilient retailer of log information for consumption. It additionally permits for a number of customers to course of log information in actual time, offering a persistent retailer of real-time information for functions to devour. This permits the log analytics pipeline to satisfy Properly-Architected finest practices for resilience (REL04-BP02) and price (COST09-BP02).
OpenSearch Ingestion is a serverless pipeline that gives highly effective instruments for extracting, reworking, and loading information into an OpenSearch Service area. OpenSearch Ingestion integrates with many AWS companies, and offers ready-made blueprints to speed up ingesting information for quite a lot of analytics use instances into OpenSearch Service domains. When paired with Kinesis Information Streams, OpenSearch Ingestion permits for classy real-time analytics of knowledge, and helps scale back the undifferentiated heavy lifting of making a real-time search and analytics structure.
Resolution overview
On this resolution, we contemplate a typical use case for centralized log aggregation for a corporation. Organizations would possibly contemplate a centralized log aggregation method for quite a lot of causes. Many organizations have compliance and governance necessities which have stipulations for what information must be logged, and the way lengthy log information have to be retained and stay searchable for investigations. Different organizations search to consolidate software and safety operations, and supply widespread observability toolsets and capabilities throughout their groups.
To satisfy such necessities, you have to acquire information from log sources (producers) in a scalable, resilient, and cost-effective method. Log sources might differ between software and infrastructure use instances and configurations, as illustrated within the following desk.
Log Producer | Instance | Instance Producer Log Configuration |
Software Logs | AWS Lambda | Amazon CloudWatch Logs |
Software Brokers | FluentBit | Amazon OpenSearch Ingestion |
AWS Service Logs | Amazon Net Software Firewall | Amazon S3 |
The next diagram illustrates an instance structure.
You need to use Kinesis Information Streams for quite a lot of these use instances. You may configure Amazon CloudWatch logs to ship information to Kinesis Information Streams utilizing a subscription filter (see Actual-time processing of log information with subscriptions). In case you ship information with Kinesis Information Streams for analytics use instances, you should utilize OpenSearch Ingestion to create a scalable, extensible pipeline to devour your streaming information and write it to OpenSearch Service indexes. Kinesis Information Streams offers a buffer that may help a number of customers, configurable retention, and built-in integration with quite a lot of AWS companies. For different use instances the place information is saved in Amazon Easy Storage Service (Amazon S3), or the place an agent writes information resembling FluentBit, an agent can write information on to OpenSearch Ingestion with out an intermediate buffer due to OpenSearch Ingestion’s built-in persistent buffers and computerized scaling.
Standardizing logging approaches reduces improvement and operational overhead for organizations. For instance, you would possibly standardize on all functions logging to CloudWatch logs when possible, and in addition deal with Amazon S3 logs the place CloudWatch logs are unsupported. This reduces the variety of use instances {that a} centralized staff must deal with of their log aggregation method, and reduces the complexity of the log aggregation resolution. For extra subtle improvement groups, you would possibly standardize on utilizing FluentBit brokers to put in writing information on to OpenSearch Ingestion to decrease price when log information doesn’t must be saved in CloudWatch.
This resolution focuses on utilizing CloudWatch logs as an information supply for log aggregation. For the Amazon S3 log use case, see Utilizing an OpenSearch Ingestion pipeline with Amazon S3. For agent-based options, see the agent-specific documentation for integration with OpenSearch Ingestion, resembling Utilizing an OpenSearch Ingestion pipeline with Fluent Bit.
Stipulations
A number of key items of infrastructure used on this resolution are required to ingest information into OpenSearch Service with OpenSearch Ingestion:
- A Kinesis information stream to combination the log information from CloudWatch
- An OpenSearch area to retailer the log information
When creating the Kinesis information stream, we advocate beginning with On-Demand mode. It will permit Kinesis Information Streams to robotically scale the variety of shards wanted on your log throughput. After you establish the regular state workload on your log aggregation use case, we advocate transferring to Provisioned mode, utilizing the variety of shards recognized in On-Demand mode. This may also help you optimize long-term price for high-throughput use instances.
Basically, we advocate utilizing one Kinesis information stream on your log aggregation workload. OpenSearch Ingestion helps as much as 96 OCUs per pipeline, and 24,000 characters per pipeline definition file (see OpenSearch Ingestion quotas). Which means every pipeline can help a Kinesis information stream with as much as 96 shards, as a result of every OCU processes one shard. Utilizing one Kinesis information stream simplifies the general course of to combination log information into OpenSearch Service, and simplifies the method for creating and managing subscription filters for log teams.
Relying on the dimensions of your log workloads, and the complexity of your OpenSearch Ingestion pipeline logic, it’s possible you’ll contemplate extra Kinesis information streams on your use case. For instance, it’s possible you’ll contemplate one stream for every main log kind in your manufacturing workload. Having log information for various use instances separated into totally different streams may also help scale back the operational complexity of managing OpenSearch Ingestion pipelines, and lets you scale and deploy modifications to every log use case individually when required.
To create a Kinesis Information Stream, see Create an information stream.
To create an OpenSearch area, see Creating and managing Amazon OpenSearch domains.
Configure log subscription filters
You may implement CloudWatch log group subscription filters on the account degree or log group degree. In each instances, we advocate making a subscription filter with a random distribution methodology to verify log information is evenly distributed throughout Kinesis information stream shards.
Account-level subscription filters are utilized to all log teams in an account, and can be utilized to subscribe all log information to a single vacation spot. This works properly if you wish to retailer all of your log information in OpenSearch Service utilizing Kinesis Information Streams. There’s a restrict of 1 account-level subscription filter per account. Utilizing Kinesis Information Streams because the vacation spot additionally lets you have a number of log customers to course of the account log information when related. To create an account-level subscription filter, see Account-level subscription filters.
Log group-level subscription filters are utilized on every log group. This method works properly if you wish to retailer a subset of your log information in OpenSearch Service utilizing Kinesis Information Streams, and if you wish to use a number of totally different information streams to retailer and course of a number of log varieties. There’s a restrict of two log group-level subscription filters per log group. To create a log group-level subscription filter, see Log group-level subscription filters.
After you create your subscription filter, confirm that log information is being despatched to your Kinesis information stream. On the Kinesis Information Streams console, select the hyperlink on your stream identify.
Select a shard with Beginning place set as Trim horizon, and select Get information.
It’s best to see information with a singular Partition key column worth and binary Information column. It’s because CloudWatch sends information in .gzip format to compress log information.
Configure an OpenSearch Ingestion pipeline
Now that you’ve a Kinesis information stream and CloudWatch subscription filters to ship information to the information stream, you possibly can configure your OpenSearch Ingestion pipeline to course of your log information. To start, you create an AWS Id and Entry Administration (IAM) function that enables learn entry to the Kinesis information stream and browse/write entry to the OpenSearch area. To create your pipeline, your supervisor function that’s used to create the pipeline would require iam:PassRole
permissions to the pipeline function created on this step.
- Create an IAM function with the next permissions to learn out of your Kinesis information stream and entry your OpenSearch area:
- Give your function a belief coverage that enables entry from osis-pipelines.amazonaws.com:
For a pipeline to put in writing information to a website, the area should have a domain-level entry coverage that enables the pipeline function to entry it, and in case your area makes use of fine-grained entry management, then the IAM function must be mapped to a backend function within the OpenSearch Service safety plugin that enables entry to create and write to indexes.
- After you create your pipeline function, on the OpenSearch Service console, select Pipelines underneath Ingestion within the navigation pane.
- Select Create pipeline.
- Seek for Kinesis within the blueprints, choose the Kinesis Information Streams blueprint, and select Choose blueprint.
- Below Pipeline settings, enter a reputation on your pipeline, and set Max capability for the pipeline to be equal to the variety of shards in your Kinesis information stream.
In case you’re utilizing On-Demand mode for the information stream, select a capability equal to the present variety of shards within the stream. This use case doesn’t require a persistent buffer, as a result of Kinesis Information Streams offers a persistent buffer for the log information, and OpenSearch Ingestion tracks its place within the Kinesis information stream over time, stopping information loss on restarts.
- Below Pipeline configuration, replace the pipeline supply settings to make use of your Kinesis information stream identify and pipeline IAM function Amazon Useful resource Title (ARN).
For full configuration data, see . For many configurations, you should utilize the default values. By default, the pipeline will write batches of 100 paperwork each 1 second, and can subscribe to the Kinesis information stream from the newest place within the stream utilizing enhanced fan-out, checkpointing its place within the stream each 2 minutes. You may alter this habits as desired to tune how continuously the patron checkpoints, the place it begins within the stream, and use polling to cut back prices from enhanced fan-out.
- Replace the pipeline sink settings to incorporate your OpenSearch area endpoint URL and pipeline IAM function ARN.
The IAM function ARN have to be the identical for each the OpenSearch Servicer sink definition and the Kinesis Information Streams supply definition. You may management what information will get listed in several indexes utilizing the index definition within the sink. For instance, you should utilize metadata concerning the Kinesis information stream identify to index by information stream (${getMetadata("kinesis_stream_name")
), or you should utilize doc fields to index information relying on the CloudWatch log group or different doc information (${path/to/discipline/in/doc}
). On this instance, we use three document-level fields (data_stream.kind
, data_stream.dataset
, and data_stream.namespace
) to index our paperwork, and create these fields in our pipeline processor logic within the subsequent part:
Lastly, you possibly can replace the pipeline configuration to incorporate processor definitions to rework your log information earlier than writing paperwork to the OpenSearch area. For instance, this use case adopts Easy Schema for Observability (SS4O) and makes use of the OpenSearch Ingestion pipeline to create the specified schema for SS4O. This consists of including widespread fields to affiliate metadata with the listed paperwork, in addition to parsing the log information to make information extra searchable. This use case additionally makes use of the log group identify to establish totally different log varieties as datasets, and makes use of this data to put in writing paperwork to totally different indexes relying on their use instances.
- Rename the CloudWatch occasion timestamp to mark the noticed timestamp when the log was generated utilizing the rename_keys processor, and add the present timestamp because the processed timestamp when OpenSearch Ingestion dealt with the report utilizing the date processor:
- Use the add_entries processor to incorporate metadata concerning the processed doc, together with the log group, log stream, account ID, AWS Area, Kinesis information stream data, and dataset metadata:
- Use conditional expression syntax to replace the
data_stream.dataset
fields relying on the log supply, to regulate what index the doc is written to, and use the delete_entries processor to delete the unique CloudWatch doc fields that have been renamed: - Parse the log message fields to permit structured and JSON information to be extra searchable within the OpenSearch indexes utilizing the grok and parse_json
Grok processors use sample matching to parse information from structured textual content fields. For examples of built-in Grok patterns, see java-grok patterns and dataprepper grok patterns.
When it’s all put collectively, your pipeline configuration will seem like the next code:
- When your configuration is full, select Validate pipeline to examine your pipeline syntax for errors.
- Within the Pipeline function part, optionally enter a suffix to create a singular service function that shall be used to start out your pipeline run.
- Within the Community part, choose VPC entry.
For a Kinesis Information Streams supply, you don’t want to pick a digital non-public cloud (VPC), subnets, or safety teams. OpenSearch Ingestion solely requires these attributes for HTTP information sources which might be situated inside a VPC. For Kinesis Information Streams, OpenSearch Ingestion makes use of AWS PrivateLink to learn from Kinesis Information Streams and write to OpenSearch domains or serverless collections.
- Optionally, allow CloudWatch logging on your pipeline.
- Select Subsequent to overview and create your pipeline.
In case you’re utilizing account-level subscription filters for CloudWatch logs within the account the place OpenSearch Ingestion is operating, this log group must be excluded from the account-level subscription. It’s because OpenSearch Ingestion pipeline logs might trigger a recursive loop with the subscription filter that would result in excessive volumes of log information ingestion and price.
- Within the Overview and create part, select Create pipeline.
When your pipeline enters the Energetic state, you’ll see logs start to populate in your OpenSearch area or serverless assortment.
Monitor the answer
To keep up the well being of the log ingestion pipeline, there are a number of key areas to observe:
- Kinesis Information Streams metrics – It’s best to monitor the next metrics:
- FailedRecords – Signifies a problem in CloudWatch subscription filters writing to the Kinesis information stream. Attain out to AWS Assist if this metric stays at a non-zero degree for a sustained interval.
- ThrottledRecords – Signifies your Kinesis information stream wants extra shards to accommodate the log quantity from CloudWatch.
- ReadProvisionedThroughputExceeded – Signifies your Kinesis information stream has extra customers consuming learn throughput than provided by the shard limits, and it’s possible you’ll want to maneuver to an enhanced fan-out shopper technique.
- WriteProvisionedThroughputExceeded – Signifies your Kinesis information stream wants extra shards to accommodate the log quantity from CloudWatch, or that your log quantity is being inconsistently distributed to your shards. Ensure that the subscription filter distribution technique is about to random, and contemplate enabling enhanced shard-level monitoring on the information stream to establish sizzling shards.
- RateExceeded – Signifies {that a} shopper is incorrectly configured for the stream, and there could also be a problem in your OpenSearch Ingestion pipeline inflicting it to subscribe too typically. Examine your shopper technique for the Kinesis information stream.
- MillisBehindLatest – Signifies the improved fan-out shopper isn’t maintaining with the load within the information stream. Examine the OpenSearch Ingestion pipeline OCU configuration and ensure there are enough OCUs to accommodate the Kinesis information stream shards.
- IteratorAgeMilliseconds – Signifies the polling shopper isn’t maintaining with the load within the information stream. Examine the OpenSearch Ingestion pipeline OCU configuration and ensure there are enough OCUs to accommodate the Kinesis information stream shards, and examine the polling technique for the patron.
- CloudWatch subscription filter metrics – It’s best to monitor the next metrics:
- DeliveryErrors – Signifies a problem in CloudWatch subscription filter delivering information to the Kinesis information stream. Examine information stream metrics.
- DeliveryThrottling – Signifies inadequate capability within the Kinesis information stream. Examine information stream metrics.
- OpenSearch Ingestion metrics – For really useful monitoring for OpenSearch Ingestion, see Beneficial CloudWatch alarms.
- OpenSearch Service metrics – For really useful monitoring for OpenSearch Service, see Beneficial CloudWatch alarms for Amazon OpenSearch Service.
Clear up
Be sure to clear up undesirable AWS assets created whereas following this put up in an effort to stop extra billing for these assets. Comply with these steps to wash up your AWS account:
- Delete your Kinesis information stream.
- Delete your OpenSearch Service area.
- Use the DeleteAccountPolicy API to take away your account-level CloudWatch subscription filter.
- Delete your log group-level CloudWatch subscription filter:
- On the CloudWatch console, choose the specified log group.
- On the Actions menu, select Subscription Filters and Delete all subscription filter(s).
- Delete the OpenSearch Ingestion pipeline.
Conclusion
On this put up, you discovered methods to create a serverless ingestion pipeline to ship CloudWatch logs in actual time to an OpenSearch area or serverless assortment utilizing OpenSearch Ingestion. You need to use this method for quite a lot of real-time information ingestion use instances, and add it to present workloads that use Kinesis Information Streams for real-time information analytics.
For different use instances for OpenSearch Ingestion and Kinesis Information Streams, contemplate the next:
To proceed bettering your log analytics use instances in OpenSearch, think about using a number of the pre-built dashboards out there in Integrations in OpenSearch Dashboards.
Concerning the authors
M Mehrtens has been working in distributed programs engineering all through their profession, working as a Software program Engineer, Architect, and Information Engineer. Previously, M has supported and constructed programs to course of terrabytes of streaming information at low latency, run enterprise Machine Studying pipelines, and created programs to share information throughout groups seamlessly with various information toolsets and software program stacks. At AWS, they’re a Sr. Options Architect supporting US Federal Monetary clients.
Arjun Nambiar is a Product Supervisor with Amazon OpenSearch Service. He focuses on ingestion applied sciences that allow ingesting information from all kinds of sources into Amazon OpenSearch Service at scale. Arjun is all for large-scale distributed programs and cloud-centered applied sciences, and is predicated out of Seattle, Washington.
Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search functions and options. Muthu is within the matters of networking and safety, and is predicated out of Austin, Texas.