Modern Data Stack and the Data Chasm Part 1: Emergence of Complexity in Data Systems

Modern Data Stack and the Data Chasm Part 1: Emergence of Complexity in Data Systems

Emergence of Complexity in Data Systems

State of the Data Market

Today's data teams are overwhelmed by a plethora of poorly integrated technologies, which have been accumulated over the years due to buying into hype, investing in half-baked solutions, and succumbing to FOMO on the latest –yet unproven– niche innovations. Moreover, many have relinquished control of their architecture to vendors who prioritize selling “futures" rather than addressing fundamental issues. What's left is a monstrous (modern!?) stack of mismatched parts (or pokemons [1]): Open source and proprietary, new and old, best-of-breed and cheap imitations – all cobbled together with little design or governance in mind.

No one sets out to build such a mess on purpose. Rather; an excess of choice in the data marketplace, rapid hype cycles, fear of lock-in to any single vendor's agenda, decentralized buying across groups and willingness to buy into futures over fixing the present leads to an increasingly uncoordinated technology sprawl. Alone, each new tool seems like a reasonable quick fix or experiment, but they collectively end up forming a Gordian knot of complexity. This complexity has its roots in the ever expanding nature of the data market, a phenomenon we must understand to navigate the challenges it poses.

Lift-off: Multidimensional Expansion

Expansion in the data ecosystem is multidimensional and self-perpetuating. One dimension is the type of data we utilize, which is driven by both supply-side capabilities (we can now capture and provide more data about everything) and demand-side forces (data consumers wanting more from data teams). Structured, unstructured, time-series, geospatial, event and even multi-modal data are all growing exponentially – each with its unique properties and applications. This growing versatility of available data and its potential to generate value across various fields, from business intelligence and marketing analytics to scientific research and urban planning, continuously expands the universe of use cases and applications. As we continue to leverage data in increasingly sophisticated ways, the demand for different/new kinds of data also grows, which forms a positive feedback loop and further stimulates the data market's seemingly endless expansion.

In tandem with the expansion in data types, the types of databases used to store, retrieve, and manage data have also diversified. For instance, traditional structured data are best handled by relational databases like PostgreSQL, while unstructured data have found their niche in NoSQL databases like MongoDB. Time-series data, central to IoT applications and financial analyses, led to the rise of specialized databases like InfluxDB. Event data, crucial for real-time analytics and application monitoring, are managed efficiently by message brokers like Kafka. Development of such dedicated technologies underscores the breadth and diversity of data storage and management solutions resulting from the expansion in data types.

Cascading effects of this expansion do not stop at databases. The rapid proliferation of data types and corresponding databases have fostered the evolution of middleware technologies too. These include ETL tools like Talend that integrate disparate databases, data processing frameworks like Apache Spark that handle large-scale data processing separately from databases, and data cataloging tools like Alation that provide a collaborative platform for data discovery. These technologies form a new dimension of growth, the “Middleware Matrix”, encompassing an array of solutions designed to connect, integrate, and manage the myriad databases emerging from the data types expansion.

Alongside the Middleware Matrix, the growing diversity of data types is fueling the development of an array of orthogonal applications. For example, structured data underpin the creation of CRUD applications vital to web development. Streaming data has led to the emergence of Application Performance Monitoring (APM) applications like Datadog, Splunk and Dynatrace, crucial for real-time system monitoring. More recently, vector databases are now facilitating Large Language Model (LLM) applications, unlocking new possibilities in natural language processing. Each new application contributes to an “Application Area” dimension, adding another layer of complexity to the expanding data market.

Figure 1: Evolution of Data and AI Space from 2012 to 2023, by Matt Turck

In conclusion, the multi-dimensional expansion of the data market is driven by the growth of data types and infrastructure, the evolution of middleware technologies, and the proliferation of orthogonal applications. Each dimension feeds into and enriches the others, collectively shaping a rapidly expanding and dynamic data market. This expansion presents a rich array of opportunities, but it also poses significant challenges, requiring organizations to adopt a strategic and pragmatic approach to navigate their way in this complex data ecosystem.

Fragmentation: Hitting the Data Chasm

In an effort to serve the ever-growing set of needs, the data market fragments as it expands. However, we are now witnessing this fragmentation curtail the maturity of the market [2][3][4] – value creation stalls as dissemination of the technology plateaus. The key insight is that even though tech market fragmentation is inevitable, it can (and must) be actively managed to prevent stalling maturity. Unchecked, fragmentation can lead to missed opportunities and an inability to fully capitalize on innovations. To illustrate this point, let's draw parallels with the cybersecurity ecosystem where the sprawling of tools is not only a natural occurrence but a necessity.

In cybersecurity, organizations invest in a multitude of tools to mitigate or manage the ever-evolving risks that emerge from their IT investments. Given the segregated duties across dedicated teams and consequential risks of procuring large numbers of security devices from a single vendor, cybersecurity finds strength in its siloed nature. Since a single department or leader often lacks the necessary perspective to optimize at the macro level, the fragmentation of tools seems justified as the landscape continues to complexify.

In the data ecosystem, while one may argue for certain advantages fragmentation brings, the underlying dynamics are not really analogous to those in cybersecurity. Even though organizations tend to invest in a new type of database for each new type of data, spreading data across regions, devices, and databases; data inherently generates more value when integrated. Many data teams have moved away from data silos for this very reason – in contrast with cybersecurity teams who see silos as a way of mitigating risk and perceive them essential. The lens through which data teams look at their systems is predominantly one of engineering costs and business value, rather than risk.

Visualize the data ecosystem's demand-side, i.e. organizations utilizing data tooling, as a bell curve. At one extreme, we have small teams using only several tools, such as:

  • A gaming company using Firebase, BigQuery and Looker for their analytics workflows, 
  • A SaaS company using Prometheus and Grafana for observability, 
  • Companies that opt for special-purpose, ready-to-use solutions, like Datadog for APM [5].

As we move to the other extreme, we see large teams with hyper-specialized needs, met by 50+ tools. Tooling fragmentation depends on the maturity of the team: Traditional enterprises tend to use a mixture of core tech/middleware and special-purpose applications, but when we move into the realm of tech giants (e.g. FAANG/MATANA), we start encountering platform engineering teams that deconstruct these applications and reconstruct them in the most cost-effective and customized way for that organization.

However, in the heart of this distribution, the so-called "fat belly of the market," midsize teams find themselves facing a peculiar phenomenon: As their data needs grow, they quickly need to leap from using 2-4 tools to wrestling a staggering 15-20. This sudden expansion triggers a surge in integration and alignment challenges, exacerbated by inadequate budgets and a talent market that falls short of meeting the growing demand. These teams often resort to employing new tools in an attempt to alleviate system complexity, inadvertently plunging into a vicious cycle: In an effort to solve tool complexity problems, they procure more tools, ending up compounding the issue and entrapping themselves in a “data chasm”.

Figure 2: The seemingly paradoxical data chasm challenging mid-size companies.

From the perspective of vendors and venture capitalists, this vicious cycle mirrors a flywheel effect, driving rapid startup growth. However, this viewpoint tends to overlook the escalating problems on the end-user side, where a few individuals are left grappling with the needs of a fairly large company using an unwieldy array of tools.

As the data market relentlessly expands and fragments, the ensuing complexity and cost of managing data intensifies for all data teams. Particularly affected are the midsize teams, entrapped in a vast data chasm and grappling with the challenge of charting a path out. Nevertheless, certain trailblazing organizations have succeeded in crossing this chasm, extracting value without further deepening their predicaments. In the next part, we will go over some case studies, look at how these pioneering companies deciphered lean data methodologies to navigate the data chasm.

References and Remarks

[1] Anyone remember “Is it Pokemon or Big Data?”?

[2] Wes McKinney's blog post looks back on the technical evolution of the big data ecosystem over 15 years. 

[3] Benn Stancil likens this fragmentation to gerrymandering.

[4] This blog post and tweetstorm by Erik Bernhardsson, CEO of Modal, discusses the data ecosystem's natural tendency to fragment and the right level of specialization. 

[5] Such solutions can be thought of as wrapped, turnkey versions of core data technologies and the middleware matrix. An APM solution, for example, could be abstracted as a TSDB coupled with solid data processing capabilities; a CRM solution can be thought of as a SQL DB with workflows.

Mehmet Ozan Kabak

Mehmet Ozan Kabak

Co-founder and CEO @ Synnada
Sami Can Tandoğdu

Sami Can Tandoğdu

Co-founder and COO @ Synnada

Get early access to AI-native data infrastructure