Data Engineering Study Report

Data engineering is the field dedicated to building data infrastructure to ingest, process, and store large amounts of data. This is a quickly growing field, with both the number of jobs in data engineering and the number of tools on the market steadily increasing.

Despite the popularity of data engineering as a field and the expanding market for paid and free data engineering tools, we couldn’t find a comprehensive, research-based report on the widespread popularity of different tools, what technologies companies are investing in the most, and which skills employers are seeking.

We evaluated the popularity of 59 data engineering tools using 3.5 billion data points to answer the question:

“What are the most popular data engineering tools?”

We did this by looking at tools that fell into five categories: Data Orchestration, Data Processing, Data Storage, Visualization, and Languages & Libraries.

What follows is a data-backed report on the most popular data engineering tools and what they tell us about the field at large.

Summary of key findings.

5 most popular data engineering tools. The most popular technologies are MongoDB, Tableau, Kubernetes, PostgreSQL, and Ansible.

340,000 job openings. There are over 340,000 open data engineering jobs.

Kubernetes highest paying specialty. There is a 3-fold range of salaries for data engineering jobs, with Kubernetes paying the highest, with salaries reaching $180,000, and Tableau paying the lowest.

Companies opting for free data processing tools. Free data processing tools are preferred 62% of the time, over paid alternatives.

Industry split on whether to pay for data storage. Data engineers are split on whether to pay for data storage tools, with the top four tools being an equal mix of paid and free tools.

Python most important language. Python is the most popular programming language, followed by Java and SQL.

Clear winners in the data orchestration space. Kubernetes and Ansible control 65% of the data orchestration space.

Companies preferring to pay for data visualization tools. The top 10 data visualization tools are all paid tools.

5 most popular technologies.

The five most popular data engineering tools include two data storage technologies (MongoDB and PostgreSQL), two data orchestration tools (Kubernetes and Ansible), and one data visualization tool (Tableau).

Taking a step back to look at the top 20 data engineering tools,

7 are data storage: MongoDB, PostgreSQL, Elasticsearch, Apache Hadoop, Splunk, Amazon Redshift, and OpenSearch.
5 are data orchestration: Kubernetes, Ansible, Terraform, Chef, and Puppet.
4 are data processing: Apache Spark, Apache Kafka, Segment, and Hive.
4 are for visualization: Tableau, Microsoft Power BI, Grafana, and Kibana.

What this breakdown tells us is that there is no one segment of data engineering that dominates the rest. All are important, demanding attention and resources.

Finally, a quarter of these data engineering tools are completely free to use tools: Hadoop, Kafka, Kubernetes, PostgreSQL, and Spark. We will talk more about where engineers are preferring free versus paid tools throughout the report.

Here is the full breakdown of the popularity for all of the tools we researched, with the exception of the languages and libraries which will be discussed later on.

340,000 job openings for data engineers.

We reviewed 340,000 job postings. Of the top 20 data engineering tools showing up in job openings, 35% were data orchestration, 30% data storage, 29% data visualization, and 6% data processing.

And of the tools, Tableau and Kubernetes were the clear winners for showing up in the most job openings.

We also considered languages and libraries when reviewing job postings, and we included those in a section below.

We kept languages out of this particular chart because knowing languages like Java or Python are critical for being able to use many of the technologies listed below. For instance, a data engineer using Apache Kafka would likely have some familiarity with Java, Scala, and Python.

The top twenty tools that employers are listing on job opening postings are: Tableau, Kubernetes, Ansible, Hadoop, Terraform, Splunk, Power BI, MongoDB, PostgreSQL, Elasticsearch, Puppet, Snowflake, Spark, Looker, Kafka, Redshift, Grafana, Kibana, Presto, and Google BigQuery.

Kubernetes highest paying specialty.

Number of job openings aren’t the only consideration for job seekers and employers looking to hire. Compensation is one of the most important considerations when choosing a field/job, and of the top 10 data engineering tools listed above, there is a wide range of salary compensation from the lowest paying jobs starting at $60,000 up to $180,000 for the highest paying ones.

We found that Kubernetes, Elasticsearch, PostgreSQL, and Terraform offer the best compensation ranges with all salaries coming in above $100,000 and a large percentage above $140,000.

And while there are a plethora of job openings for Tableau and Power BI, they are two of the lowest paying specialities, with 79% of job openings offering less than $100,000 for both of them.

In the chart below we show what percent of jobs for each technology fall within each of the listed ranges: $60,000+ (blue), $100,000+ (red), $140,000+ (yellow), and $180,000+ (green).

Kubernetes is the only technology that has job openings with salaries above $180,000.

Companies opting for free data processing tools.

Together, Apache Spark and Apache Kafka dominate the data processing space with over 50% of the popularity of all data processing tools considered. Interestingly, these are both free, open source tools.

Companies and data engineers aren’t seeing a widespread need to pay for data processing tools. In this area, companies are investing in employees and/or consultants that are experts in free, open source technologies.

Industry split on whether to pay for data storage.

Of the 11 data storage tools evaluated, the most popular was MongoDB, a paid tool, followed by PostgreSQL, a free tool. The popularity of completely free and paid tools are fairly well split in the area of data storage, with paid tools being preferred 59% of the time.

Some tools, like Elasticsearch, have both paid and free options. We bucketed Elasticsearch and others with similar freemium options into paid tools because some functionality isn’t available in the free versions.

One interesting technology to see in the top 20 is OpenSearch, an open source tool created as a split from Elasticsearch version 7.10. Only a little over a year since its launch, OpenSearch is competing with some of the more established technologies.

Python is the most popular language.

Python is the most popular language, quickly followed by Java and then SQL. The importance of language skills is driven home by open job listings, with Python, SQL, and Java currently listed in 550,000 job openings.

We looked at six languages and libraries used for data engineering work. Python was the most popular being preferred 38% of the time, followed closely by Java (33%) and then SQL (22%).

Unsurprisingly, more niche libraries and languages Pandas (4%), Scala (2%), and Julia (1%) are used less of the time.

Kubernetes and Ansible favorites of the data orchestration space.

Data orchestration tools are quickly becoming a must for stateful applications within data infrastructure. Data engineers are choosing to use data orchestration tools because they can reduce development time, improve scalability, and assist with handling multiple cloud environments.

Of the six data orchestration tools we studied, Kubernetes is currently the clear winner, followed by Ansible.

While both tools fall under our data orchestration category, they are used differently. Kubernetes is used for management and maintaining container health. Ansible is used to deploy changes, configuration, and manage updates and deployments.

We also looked at the popularity of Puppet and SaltStack, which came in at 5% and 1%.

Companies preferring to pay for visualization tools.

Companies and their data teams are seeing the value in using paid data visualization tools, with Tableau the clear favorite.

Data visualization is important for pulling insights out of collected and processed data. Companies use the visualized data to identify patterns and trends to assist with decision making. This use of data visualization is often referred to as business intelligence.

Some products, like Kibana, have free versions available for use. However, those versions do not include all of the functionality of the paid product. For this reason we consider Kibana and similar products to be paid products, not free products.

We looked at 13 visualization tools all together. Six of them didn’t have enough popularity to show on the final chart. Those products include Periscope, IBM ELM, Logilica, Databank, OpenSearch Dashboards, and Allstacks.

Having less broad popularity doesn’t necessarily report on the quality of a product. Some products like Tableau can be applied to many different use cases. Whereas, niche products like IBM Engineering Lifecycle Management are designed for a specific use.

Summary and conclusion.

Thanks for taking the time to read our report. We hope it was informative. For details on our research methods, check out our methods page linked here.

Now it’s your turn:

What do you think the most popular tools are based on your experience?

Any tools we didn’t include that you would like to see in the next report?

Or maybe you have a question about the results.

Either way, we’d like to hear from you. Leave a comment below or @dattell_support on Twitter.

24x7 Data Engineering Support & Consulting

Visit our OpenSearch page for more details on our support services.

Data Engineering Study Report