Data Engineering Study Research Methods

This page outlines the research methods for our Data Engineering Study, published on our website June 2022.  

Tool Inclusion

The 59 tools included in the study were chosen by first reviewing the lists for “top data engineering tools” found on Google. We included all tools listed in the first page of search returns.  Additionally, our team reviewed the listed tools and added to them based on our 10+ years of experience in the data engineering field. 

Collecting Data

Data was collected from Google, Indeed, LinkedIn, Github, and Stackshare.  All data points were collected between June 7, 2022 and June 13, 2022.  For some searches, the names of tools had to be qualified to get accurate results.  For instance, Pandas was searched on Google as “Pandas Library” and Presto as “Presto Software”.  In Github, since it is a more specific engine, those kinds of qualifiers were dropped.

In LinkedIn, we specifically collected the number of people profiles that included a particular search term.  In Stackshare we collected the number of stacks and the number of followers for each term.  We found that the two metrics trended together.  For evaluation and comparison purposes we only used the number of Stackshare followers.

Normalization of Data

For comparing the different technologies, we first normalized each individual metric (Google search returns, Indeed job openings, etc.) and then added them together.  This was done so that each metric held the same importance in determining a tool’s popularity.  The highest possible score for any technology was 4.

Free vs. Paid Delineation

Listing tools as free and paid can be a difficult task, as many tools have both free and paid versions.  For instance, Ansible and Elasticsearch can be used for free, but they are not open source tools and the free versions are missing functionality that can be important to users. For that reason tools that are paid tools with a free version were considered paid instead of free.  The tools listed as free are ones that are always free.

Other Questions

If you have any additional questions about data collection, please contact