Kuwala.io

The complexity of data projects

‍

8 years ago, companies started to look at how to make better decisions for the business using the data they collect. At that time, I started my career as a data scientist and I heard for the first time that there was a shortage of data scientists and software developers. Today in 2022, that picture has become even more dramatic. The demand for data and engineering jobs is at twice the rate of the absolute supply. Salaries for data talents exploded and data science was the “sexiest job in the world” twice in a row. And the demands for developers continue to grow: new frameworks, new apps, and new ideas require complex solutions. Two years ago in Rasa’s office in Kreuzberg Berlin, one of my favorite data bloggers told me that the “Data science fallout winter” is coming. What he meant was that companies were disappointed that data projects rarely end as a success. From my observations, this was due to several reasons:

It takes a long time for a manager to align with a data scientist and software engineer in a way that translates business goals into a data strategy and data project. The feedback loops were taking too long.
Expectations were set too high and timelines were planned far too tightly. A data project cannot be treated like a normal software project which is already fairly complex.
The data collection process and cleansing tied up a lot of the resources.

Then, on a Venturebeat panel, the often-quoted figure was thrown into the room: 85% of data projects fail. The symptom of the data science fallout winter was there.

The renaissance of open-source led by the modern data stack

While companies were a bit timid about investing in data science and business intelligence projects, grandiose solutions were emerging among software developers. GitHub was the place where big solutions were born. A safe space for engineers with a vision. On GitHub, some of the hottest frameworks in the data space quietly emerged. We are talking about: Elastic Search (database), Airflow (data pipelines), dbt (transformations), Meltano (data extracting), and most recently Airbyte (data extracting). These tools are often summarized under the buzzword “modern data stack”. They run flexibly on a data warehouse like Snowflake, the setup is relatively simple, and the services can be combined to map the processing and analysis of data.

‍

The Modern Data Stack Landscape drawn by a16z

But why open-source? Why hasn’t Microsoft or another tech giant owned and branded the modern data stack? The answer is trivial and the realization has been world-changing: Data projects are the most complex task in software development. Every data project is an edge case no matter if the business question might be the same. Data from a wide variety of sources with varying data quality rushes into enterprises at an even wider variety of frequencies. The data must be melted together, and the methods of analysis and interpretation of the data are as diverse as fauna and flora. Complex software can only be accomplished with many differently qualified developers. So many developers that even a tech giant would not have been able to build the foundation for the modern data stack without neglecting the core business.

What did the modern data stack change for companies?

In the meantime on the business side, more and more companies are adopting the modern data stack with a data warehouse in the center point of business intelligence (BI). I am very happy to follow this development since companies were able to work lean and agile with data. But one problem remains: Companies still have too few developers. And in data projects, managers and engineers still don’t speak the same language. The role of the business analyst got more and more relevant! The mediator between both parties: a BI Analyst that understands the fundamentals of data analysis and the objective of increasing the company’s ROI but lacks the technical coding skills.

‍

Google Trends Evolution for BI Analysts and No Code Application

Startups emerged in 2021 with the promise of connecting data without coding skills, creating analytics, and moving companies forward in the digital transformation and data literacy. The startups build on the open modern data stack and give it a user-friendly wrapper. Prominently, they even advertise that the UI would replace dbt, Airbyte, Snowflake, and Airflow. However, the truth is that all of these technologies run in the background of the software, which is resold under their license. The modern data stack that was developed to be freely available is now being sold to customers as a SaaS tool?

If you take the market cap of the closed SaaS solutions together and compare it on the time axis with the market cap of opens-source solutions, you can see how the market gains for SaaS are decreasing. On the other hand, a very steep growth can be seen for open-source solutions. Early open-source companies like RedHat, databricks or elastic are nowadays publicly traded companies. OSS Capital puts it this way: “Open-source is eating software faster than software is eating the world”.

‍

Each use case must be built by the internal development team itself. This means that the focus can only be on simple, top use cases.
Sales takes place via sales reps and conventional marketing. Scalability is limited.

→ High costs for sales and product development

No adaptability of the tool to the individual use case.
High lock-in effect.
Training for new users required.

→ The low cost-benefit for the customer in the long run

‍

Market Cap for OSS and SaaS by OSS Capital

In the end, we see a frustrated customer who uses a tool because it was bought once but the results are not satisfying. This issue can be well illustrated by the example of Airbyte vs. Fivetran. Fivetran started in the market 10 years ago and helps software engineers to extract data from one source and load it into another data source, e.g., a data warehouse. Over time, many new data sources evolved from SaaS tools such as advertiser tools (Facebook), CRM, and reporting tools. Today, the number of SaaS tools exceeds 10,000, and with Fivetran’s closed approach, it was only possible to maintain and connect 1% of the data sources. And this process alone is a heavy lifting task if you build it internally. It is also very costly, which is also reflected in the subscription prices. As a result, customers were increasingly frustrated, paying a high price for limited usability. Airbyte recognized this problem 1.5 years ago. Instead of relying on a closed approach, Airbyte adopted an open-source business model. The Airbyte team makes it easy for developers to connect new data sources that are still missing. Airbyte is free to use and easy to set up by an engineer who finds the repo on GitHub. Thus, the complexity of the tool continues to grow independently with each new use case and data connector built by the community. Key capabilities for shift hand in hand with the different business models. It is now about serving the contributors in the community, making it easily accessible, and building out the overarching roadmap. Developing a convenient solution for enterprises that feels not like a ripoff.

The Rise of Airbyte summarized in Figures

The success proves Airbyte right, within 1.5 years Airbyte won 16,000 customers while Fivetran stands at 2,000 customers. Of course, Fivetran customers are all paying, while Airbyte has a lower turnover, but considering the long-term development and renaissance of open-source, it is only a matter of time until enough customers pay for Airbyte. The popularity, the developer-friendly features, and the transparent subscription model is too convincing for that. (Picture Airbyte vs. Fivetran)

The remaining problem with open-source

The success story of Airbyte should not hide the problems of open-source. An average contributor works only 3 months on an open-source project before moving on to a new adventure. All the pressure is on the main contributors and initiators of the project. This problem became obvious with the Log4j vulnerability in 2021. Many large companies including Apple, Microsoft, and Cloudfare were using the open-source library. When the vulnerability became known, these companies turned to the Log4j team to fix the problem as soon as possible. We are talking about multi-billion dollar companies here, turning to a group of idealists, some of whom only work on the libary in their spare time. Understandably, the situation feels thankless, since most companies don’t pay for tech support or even make it known to them beforehand that they were using the Log4j libary to the initiators. The call became louder for more transparency and a system of monetization and valuation of the work. This is the biggest hurdle open-source software has to overcome. Contributors should feel appreciated and get something in return (dollars are also not always the answers). Systems should be developed that keep contributors longer on a project and value their work adequately.

Some Tweets around log4j vulnerability in 2022

What do we learn from this?

In summary, we are currently experiencing a renaissance of open-source software. Traditional SaaS business models in the data space are difficult to implement because data projects are too complex to solve in a proprietary tool for customers. I drew an example around no-code data platforms that rely on SaaS business models in particular but build on open-source without openly admitting it. This can lead to a problem in terms of customer satisfaction. That this is problematic is shown by the Log4j case as well as the Airbyte/Fivetran case study. Closed SaaS tools are not only expensive to build, but also do not address the complexity of customer problems. Because closed tools rely heavily on open-source libraries, non-transparent communication also creates major security vulnerabilities. There is a certain frustration and tension on the side of open-source contributors. This holds opportunities to change something, create completely new tools, business models, and software categories. You have a different perspective? Or would like to discuss it more in depth? You can easily join our discussion on slack, here. Or just comment below.

The Paradigm Shift of Business Models in the Data Space is Real

The complexity of data projects

The renaissance of open-source led by the modern data stack

What did the modern data stack change for companies?

The remaining problem with open-source

What do we learn from this?

Kuwala is open-source and open to contributions

Join our Slack

Contribute on GitHub

Digital Coffee