Design Patterns for Large Data Pipelines with ELK Stack

The popularity of Data Pipelines has grown with the increased use of advances IoT (InternetofThings) device networks. IoT devices can produce anything from user data to location information. Today, more companies are now capturing this data in order to determine what insights it can discover about its business that it previous many not have known and how these discoveries can affect their bottom line.

This is where data pipelines play a fundamental role in facilitating actionable analytics from your business. Combining several key technologies can accomplish the goal of scalable, expandable data collection. I will cover only 2 of the building blocks of data pipelines and discuss how we can leverage them in the pipeline. The diagram below is a very high level of a simple IoT design.

Building Block 1: Kafka as a first level Data repository

With Apache Kafka, you are provided one of the key building blocks in which to build flexible pipelines that can be put into practice within the real world. Kafka is considered the next generation of message queuing. Here is what makes this technology key in building your pipeline, see below.

Kafka messaging adds the power disturbed computing and data replication for data resiliency. In enterprise Kafka deployments, your data sets can be distributed across several servers, and your data can also be replicated to ensure data resiliency.

3 Main Elements to Pipeline Flexibility

In many cases, once you have an active data pipeline, it becomes a living and breathing thing. So change can become the norm when terabytes of data are constantly flowing into the data centers. Active pipelines are designed to run 24/7 with continuous data ingestion. In many cases, this data is being actively monitored in real-time through data visualization. Real-time monitoring is common for larger IoT networks so robust data decoupling become a must. To address these concerns, pipeline flexibility plays a key role.

The first thing in terms of flexibility is to decouple the consumption of data from its production at the previous stage. Through re-configuring, it is possible to reprocess the backlog of existing data that has been processed and continue to process data as it becomes available from a source system.

Restarting is another flexibility with pipeline configuration. If by chance a load fails, data processing should not stop; it’s very helpful to be able to restart the processing from the exact place at which it left off.

Because the consumer is decoupled, the ability to start and stop the process while at the same time have it continue to ingest data from the point it was stopped is invaluable. The data will land in Kafka, which then effectively becomes the staging area, taking advantage of the “durable buffer” concept that Kafka features. This also allows for the opportunity to set up consumers of the data and also add new customers at any point you want, while the original customers remain untouched.

The ability to repeatedly replay the ingest phase of a pipeline into multiple different consumers with no change necessary to the configuration from source is invaluable. One great thing about Kafka is incredible low latency, which means you have the ability catch up on past records. You can also replay these records and stream live ones with real-time results.

The best part about this is that you can now add new consumers of the data that is in Kafka whenever you want. You don’t even need to know about them during the time that you extract the source data. This means you are able to pick up from either the end of the feed or go ahead and re-purpose the entire lot, per consumer.

There are a couple of instances within which this capacity can be effectively used. You can refine and add a GeoIP lookup step into the data processing by adding the KStreams API from Kafka. When it comes to both of these individual cases, the original existing consumer remains untouched and running.

Kafka enables you to batch or stream data one time from the source and consume it with multiple different heterogeneous applications at many different times. It also provides offset tracking which is distinct for each individual consuming application. The processing can also be re-run, which turns out to be extremely useful for both production data and the development process.

Building Block 2: ELK Cluster for Heavy Data Ingestion (One TeraByte Plus)

As IoT networks grow, they can produce terabytes of data per day. This is where our Elasticsearch cluster design comes into play. Building out the Elasticsearch cluster in these use-cases are considered nontrivial, and a lot of thought and planning should go into this. In my next article, I will go deeper into terabyte capable Elasticsearch cluster designs and share some of the insights I have gained standing up a 500TB Elasticsearch cluster.

With Logstash’s Kafka input plugin configuration, you can get Logstash to connect with your data topics stored on the Kafka servers, which then will allow it to send messages to Elasticsearch once they have been received. This is a very common ingestion pattern in the ELK world, but with data sets in the terabyte range, we have used the 24 data node Ingestion Pattern (By Douglas Miller, WeblinkTechnology).

When it comes to a full blown system, there are obviously many more moving parts than just this simple example, but the point remains the same.

____________________________________________________________________________

WOULD YOU LIKE TO LEARN MORE?

If you’re interested in more, be sure to send us an email at douglas@weblinktechs.com! And get in touch with Weblink if you have tough Elasticsearch problems you need help with!

Expert Elasticsearch Consulting and Implementation Services

Weblink Technologies, a leader in Elasticsearch products, provides a solution based solely on Elastic-search. As an Elastic partner and reseller, we have worked with many of customers across the globe to provide expert consulting and implementation for Elasticsearch, Logstash, Kibana (ELK), and Beats. Whether you are using Elasticsearch for a web-facing application, your corporate intranet, or a search-powered big data analytics platform, our Elasticsearch experts bring end-to-end services that support your search and analytics infrastructure, enabling you to maximize ROI.

  • Elasticsearch consulting and strategy planning
  • Search application assessment
  • Elasticsearch, Logstash, Kibana, and Beats (Elastic Stack) implementation
  • Search relevancy review and improvement
  • Full support and managed services: (OnSite and Remote)

Contact us at sales@weblinktechs.com to learn more about how we can help you leverage Elastic products for high-performing, easy-to-maintain, and scalable search and analytics solutions.