OpenSearch Shard Optimization


Published September 2022 Optimizing OpenSearch for shard size is an important component for achieving maximum performance from your cluster. OpenSearch shards enable parallelization of data processing across both a single node and multiple OpenSearch nodes. OpenSearch automatically manages the allocation of shards within the nodes. However, choosing the number of shards needed is up to … Continue reading OpenSearch Shard Optimization

OpenSearch Terms and Definitions


Published September 2022 In this post we round up the most searched for OpenSearch terms and definitions. OpenSearch Node An OpenSearch node is a single OpenSearch process, and the minimum number of nodes for a highly available OpenSearch cluster is three.  OpenSearch Cluster An OpenSearch cluster is one or more OpenSearch nodes with the same … Continue reading OpenSearch Terms and Definitions

Apache Pulsar Support FAQ


Published September 2022 There are common questions that new clients have about Apache Pulsar support services.  Below is a list of a few of the most common questions inquiring new clients have when they reach out.   Part A: Technical Questions We are encountering Pulsar scaling issues. Can you help? We routinely scale and optimize Pulsar for … Continue reading Apache Pulsar Support FAQ

Vector Search for OpenSearch


Published August 2022 OpenSearch includes a plugin for vector search.  In this post, we introduce vector search and compare the different methods available. We will also point you in the right direction for example code.   For  personalized help, contact us to learn more about our OpenSearch support services. What is vector search? Here’s the … Continue reading Vector Search for OpenSearch

Kafka vs Pulsar


Published August 2022 Pulsar and Kafka achieve the same result. They both guarantee messages reach their intended destination(s). Yet, there are important differences between the two message queues. These differences can make one of the technologies a better fit, depending on your use case. In this post we cover 8 ways in which Apache Kafka … Continue reading Kafka vs Pulsar

Preparing for a Cloud Outage


Published August 2022 Nearly all of our clients and a majority of companies are using the cloud for at least a portion of their infrastructure.  It’s important for companies to plan for cloud outages to minimize the damage caused by them. In this post we will cover how to minimize damage and recover quickly after … Continue reading Preparing for a Cloud Outage

OpenSearch vs. Elasticsearch


Published August 2022 With OpenSearch originating as a fork from Elasticsearch, the two databases can appear to be near-identical to the unacquainted.  However, they are unique, becoming more so with each new update. Here we will discuss how the two search engines compare when it comes to security, licensing, core features, documentation, community support, dashboards, … Continue reading OpenSearch vs. Elasticsearch

Elasticsearch Support Services FAQ


Published July 2022 Our team of engineers has been architecting, optimizing, and managing Elasticsearch for over 6 years.  We’ve found that there are common questions that new clients have about Elasticsearch support services. Below is a list of a few of the most common questions inquiring new clients have when they reach out.  Let us … Continue reading Elasticsearch Support Services FAQ

How to Prevent a Kafka Outage


Published June 2022 Apache Kafka is a highly reliable tool when configured correctly for your use case.  It should be the piece of your data architecture that you can be sure will remain online.   Here we put together eight important best practices to help shore up your Kafka implementation. 8 Tips to Prevent Kafka Downtime … Continue reading How to Prevent a Kafka Outage

Data Engineering Study


Published June 27, 2022 Data engineering is the field dedicated to building data infrastructure to ingest, process, and store large amounts of data.  This is a quickly growing field, with both the number of jobs in data engineering and the number of tools on the market steadily increasing.  Despite the popularity of data engineering as … Continue reading Data Engineering Study

What is a Virtual CIO?


Published June 2022 Virtual CIOs provide the leadership and expertise to build, grow, and maintain reliable data architecture.  They are often hired by midsized companies that are looking for a trusted authority to drive data architecture and the supporting team. Virtual CIOs are also referred to as vCIOs, fractional CIOs, part-time CIOs, and CIOs for … Continue reading What is a Virtual CIO?

What is OpenSearch?


Updated May 2022 OpenSearch is an open source search and analytics software.  It’s a community led project with Amazon Web Services (AWS) leading the development.  It was first created as a fork from Elasticsearch 7.10.2 and Kibana 7.10.2 in 2021.  The OpenSearch search engine is simply referred to as OpenSearch, and the dashboard tool is … Continue reading What is OpenSearch?

Kafka on Kubernetes


Updated August 2022 More and more companies are coming to us specifically for assistance with deploying and managing Apache Kafka on Kubernetes.  With many teams already familiar with Kubernetes and its ability to orchestrate infrastructure, it can sometimes be the best choice to spin up Kafka servers on Kubernetes alongside their other applications. Kafka on … Continue reading Kafka on Kubernetes

Elasticsearch Basics: What it is, Licensing, Languages, and Getting Help


Updated July 2022 Elasticsearch is a distributed search and analytics engine.  It is built on top of Apache Lucene.   Elasticsearch was first released in 2010 by the company now known as Elastic.  It was originally completely open source, but recent license changes have limited its usage. More on that below. Elasticsearch is part of a … Continue reading Elasticsearch Basics: What it is, Licensing, Languages, and Getting Help

Hosted Apache Pulsar: Why managed Pulsar on your environment is a better choice


Updated June 2022 One of the attractions of hosted Apache Pulsar is the peace of mind that a third party is responsible for ensuring uptime.  However, that conclusion doesn’t consider what a company loses by using a third party hosted service.  Fully managed Pulsar services, hosted directly on your internal environment (cloud or on-prem), still … Continue reading Hosted Apache Pulsar: Why managed Pulsar on your environment is a better choice

What is Kafka Connect?


Updated March 2022 Kafka Connect is a free tool for efficiently moving data into and out of Apache Kafka.  Kafka Connect simplifies streaming data while also improving scalability and reliability. Features of Kafka Connect Standardizes integrations with Kafka.  Kafka Connect provides a shared framework for all Kafka connectors, which improves efficiency for connector development and … Continue reading What is Kafka Connect?

BookKeeper for Pulsar


Updated June 2022 As discussed in a previous article, “What is Apache Pulsar?”, Pulsar is a two-layer system with Pulsar brokers acting as the serving layer and Apache BookKeeper bookies providing the persistent storage layer.  In this post we will review BookKeeper’s role, important terminology, and an introduction to configuring Ledgers. Apache BookKeeper Basics BookKeeper … Continue reading BookKeeper for Pulsar

Subscription Types in Apache Pulsar


Updated July 2022 Apache Pulsar is a publish-subscribe distributed messaging system.  When consumers subscribe to topics in Pulsar, there are four different types to choose from:  Exclusive, Failover, Shared, and Key_Shared.  In this article we will review the different subscription types and what factors to consider when choosing between them. If you are interested in … Continue reading Subscription Types in Apache Pulsar

What is Apache Pulsar?


Updated July 2022 Apache Pulsar is an open source, publish-subscribe messaging system.  It’s unique because of its two-layer system where the serving and storage layers are separated. Pulsar runs with two supporting technologies, Apache BookKeeper and Apache ZooKeeper.  The three technologies together provide a high throughput, low latency distributed messaging system. Pulsar Broker – Serving … Continue reading What is Apache Pulsar?

Solr vs Elasticsearch


Updated September 2021 Both Apache Solr and Elasticsearch are popular open source* search engines built on top of Lucene.  This article is intended to help readers learn more about the technologies in relation to one another to guide technology decisions. * Check out this article for information about recent Elasticsearch licensing changes.  Elasticsearch is no … Continue reading Solr vs Elasticsearch

How to Index Elasticsearch


Updated January 2021 An Index in Elasticsearch is used to both organize and distribute data within a cluster.  In this post we will define both components of an Index and then outline how to create, add to, delete, and reindex Indicies in Elasticsearch.  We will also touch on querying, but querying will be covered in … Continue reading How to Index Elasticsearch

Kafka Uses Consumer Groups for Scaling Event Streaming


Updated July 2022 Apache Kafka is a distributed messaging system that implements pieces of the two traditional messaging models, Shared Message Queues and Publish-Subscribe.  Both Shared Message Queues and Publish-Subscribe models present limitations for handling high throughput use cases.   Apache Kafka provides fault tolerant, high throughput stream processing that can handle even the most complicated … Continue reading Kafka Uses Consumer Groups for Scaling Event Streaming

Kafka Case Studies


Updated February 2022 Apache Kafka‘s high throughput and high availability make its applications vast.  Here we dive into eight Kafka case studies.  These accounts are taken from work our Kafka solutions architects / Kafka consultants have done in the field with our clients. Medical Manufacturing Company automating the drug manufacturing process with multiple machines needs … Continue reading Kafka Case Studies

Elasticsearch Definitions


Updated September 2022 Taking a break from Elasticsearch optimization posts to get back to the basics to define fundamental Elasticsearch concepts. Elasticsearch Terms and Definitions Elasticsearch Node.  An Elasticsearch node is a single Elasticsearch process, and the minimum number of nodes for a highly available Elasticsearch cluster is three. Continue reading about Elasticsearch Nodes. Elasticsearch … Continue reading Elasticsearch Definitions

Kafka Definitions


Updated July 2022 Taking a break from Kafka optimization posts to get back to the basics of Apache Kafka and define fundamental Kafka concepts. Kafka Definitions:  A Primer for Apache Kafka Fundamentals Kafka Producer.  A Kafka producer is a standalone application, or addition to your application, that sends data to Kafka broker(s). Kafka Broker.  A … Continue reading Kafka Definitions

Kafka Consumer Optimization


Updated December 2021 Kafka Consumer’s Role. The role of the Kafka consumer is to read data from Kafka.  Kafka consumer optimization can help avoid errors and increase performance of your application.   While the focus of this blog post is on the consumer, we will also review several broker configurations which affect the performance of consumers. Top … Continue reading Kafka Consumer Optimization

What is a Kafka Topic?


Updated April 2022 Kafka topics are the categories used to organize messages. Each topic has a name that is unique across the entire Kafka cluster. Messages are sent to and read from specific topics.  In other words, producers write data to topics, and consumers read data from topics. Kafka topics are multi-subscriber.  This means that … Continue reading What is a Kafka Topic?

Open Source Monitoring for Kafka


Updated December 2021 A critical component to ensuring Kafka uptime and maintaining peak performance is through monitoring.  Open source monitoring of disk performance, memory usage, CPU, network traffic, and load allow you to identify abnormal metrics in real-time and address potential issues before a performance dip or outage occurs. In other words, monitoring Apache Kafka … Continue reading Open Source Monitoring for Kafka

Load Balancing With Kafka


Updated July 2022 What is Kafka loading balancing? Load balancing with Kafka is a straightforward process and is handled by the Kafka producers by default.  While it isn’t traditional load balancing, it does spread out the message load between partitions while preserving message ordering. Round-robin approach:  By default, producers choose the partition assignment for each … Continue reading Load Balancing With Kafka

Kafka Use Cases


Updated April 2021 Apache Kafka is a high-throughput, open source message queue used by Fortune 100 companies, government entities, and startups alike. Part of Kafka’s appeal is its wide array of use cases.  In this post we will outline several of Kafka’s uses cases from event sourcing to tracking web activities to metrics and more. … Continue reading Kafka Use Cases

Performance Tuning for Apache Kafka


For Apache Kafka performance tuning measure latency and throughput for your Kafka implementation. Latency is the measure of how long it takes Kafka to process a single event. Throughput is the measure of how many events arrive within a particular period of time.

Elasticsearch Shards — Definitions, Sizes, Optimizations, and More


Updated September 2022 Optimizing Elasticsearch for shard size is an important component for achieving maximum performance from your cluster. To get started let’s review a few definitions that are an important part of the Elasticsearch jargon. If you are already familiar with Elasticsearch, you can continue straight to the next section. Defining Elasticsearch Jargon:  Cluster, … Continue reading Elasticsearch Shards — Definitions, Sizes, Optimizations, and More

Elasticsearch Optimization for Small, Medium, and Large Clusters


Updated July 2022 The way nodes are organized in an Elasticsearch cluster changes depending on the size of the cluster.  For small, medium, and large Elasticsearch clusters there will be different approaches for optimization. Dattell’s team of engineers are expert at designing, optimizing, and maintaining Elasticsearch implementations and supporting technologies.  Find our more about our … Continue reading Elasticsearch Optimization for Small, Medium, and Large Clusters