Data is the lifeblood of any organization paying even the slightest attention to the world around it. It drives everything from operational decision-making to advanced analytics and artificial intelligence. Over the years, how we store, manage, and access data has evolved significantly. From traditional databases to emerging paradigms like data meshes, this evolution reflects changing business needs, technological advancements, and shifting data access patterns.
Let’s explore the evolution of data platform architectures, examining the motivations behind each and the use cases they address. We’ll also dare to look at where data platforms might go.
I give myself a few labels, with technologist being one of them. As such, with the rise of generative AI, it’s been an exciting time to see how this revolutionary advancement has amped the technology community across so many facets, from hardware to software. Equally, I find myself stopping to think about the systems required to support this new surge in technological capability. When it comes to anything related to Artificial Intelligence (AI), I always start with one question: How’s your data?
Show me the fastest car on the planet, and I’ll show you a costly paperweight if you don’t have the fuel (or energy) to power it. Data is that fuel, that energy, for AI. As I reflect on the role of data in AI, I find myself thinking back through the many changes the data platform concept has experienced and the drivers that took us there. To help us understand where we’re going, let’s look back at the evolutionary changes of the data platform architecture. Perhaps by looking backward, we can look forward to where the data platform concept is evolving and why you should pay attention.
In the beginning, data needed to be stored.
One could argue that cave walls were the first known data storage system, where artwork captured narratives and daily lives. That moved to stone carvings, papyrus, paper, and then computers (with many other mediums along the way). However, computers didn’t immediately revolutionize or digitize data storage. In the early days, data lived on paper punch cards. Then operating systems were born, and with them came digital file systems, and the data storage explosion began. First with files and then with organized software layers sitting on top of files to help make data writing and fetching more cohesive and consistent. Enter the database.
Relational databases: The foundation of structured data
Relational databases (RDBs) became prominent in the 1970s, spearheaded by Edgar F. Codd’s relational model. They emerged as a solution for managing structured data consistently and reliably, especially as organizations began digitizing their operations. The SQL (Structured Query Language) standard further solidified their dominance by providing a powerful yet intuitive means of querying and managing data. The RDB technology quickly created a new market where commercial solutions offered software developers a robust data storage and management tool. DBAs reigned supreme and often commanded compensation rivaling hedge fund managers (okay, maybe not that much, but if you knew an Oracle™ DBA in the 1990s and early 2000s, you know what I mean).
As demand and performance needs grew, so did the creativity in design to accommodate these needs. For example, if you had a schema with Customer and Order tables, you could configure each table’s data to live on a different hard disk so that as queries spanning both tables executed, performance was unhindered by a single hard disk trying to read data from two tables on the same disk at the same time.
Digitizing data relationships was a monumental advancement in converting data to value. Insights could be gleaned at scales that were previously impossible or too costly to obtain manually. As demand for insights grew, the traditional, operational relational structures and database engines became taxed to performance mediocrity.
Data warehouses: Optimizing (descriptive) analytics
In the 1980s and 1990s, organizations realized that operational databases were inefficient for running analytical queries. Data warehouses emerged as a solution, providing a centralized repository optimized for Online Analytical Processing (OLAP). These systems enable businesses to aggregate data from multiple sources and generate insights through complex queries. Business Intelligence, cubes, and dashboards were born (and metrics-driven management exploded, but that’s another article).
Customer buying habits, sales performance, inventory management, payroll, production output, historical trends, and everything in between were all there: one query, histogram, pie chart, line graph, and candle chart away. As its insatiable appetite for analytics grew exponentially, the business cried, “Feed me, Seymour.”
Now, we have databases with structured data and warehouses with analytics-focused structured data (often massaged but still duplicating the database data). However, our operations databases have been relieved of the insatiable analytics demand engine, and the poor data team is continually chasing its tail, hunting down why the sales number in the CEO’s data warehouse BI dashboard doesn’t reflect what’s in the sales database.
Fret not, for while the early adopters faced challenges related to cost and scalability, a revolution was coming—a cloud revolution led by platforms like Amazon Redshift, Snowflake, and Google BigQuery. These platforms offer previously unimaginable horizontal scale, making data warehouses more accessible, elastic, and cost-effective.
Relational databases (even the data warehouse ones) remain a cornerstone of many business applications. They’re fantastic tools when someone has worked to structure the data. There’s just one small problem. If you slice and dice the world’s data into two buckets, structured and unstructured, you’ll find that 80 to 90% of all data falls in the unstructured bucket. Relational databases do many things well – unstructured data or large-scale analytics isn’t one of those things. Thus, other data, dare I say it, “platform” architectures emerged. Necessity is the mother of all invention. Let’s go swimming.
Data lakes: Managing unstructured and semi-structured data
The rise of big data in the 2000s highlighted limitations in data warehouses, particularly their inability to handle unstructured and semi-structured data such as logs, images, and social media content (thanks for forever tainting the “thumbs up” gesture). Not to be deterred, the data platform gurus conjured up data lakes as scalable repositories that could store raw data in its native format at scale – very, very big scale. And what happens when you remove constraints from a system? When you lower the effort cost in storing data? You get Parkinson’s Law for data. Data Lake becomes Data Swamp. But it’s not all mud, mosquitos, and alligators.
Expanding the data platform to store and serve unstructured data exponentially increases the amount of data available for analytics, data science, machine learning, IoT, real-time data processing, and more. There are two temptations to heed:
- Easy-Button Data Architecture: There is a real temptation to bypass data modeling and dump all the data into a data lake because pulling structure out of unstructured data can be challenging, time-consuming, costly, and tedious. However, structured data still plays a role in many use cases.
- Cookie vs. Cake: The value of a data lake isn’t just that it contains unstructured data that can be inspected and leveraged independently for analytics (cookie). It’s that multiple unstructured data sources can be analyzed together in context to extract tremendous value (cake). If I’m a plant operator, perhaps I care about sensor data, reports, and historical maintenance records. Data lakes let you combine those elements to build a layer cake of perspective.
Platforms like Apache Hadoop and, later, cloud-native solutions such as AWS S3 and Azure Data Lake Storage made data lakes practical and scalable. Are you noticing a cloud pattern here? Cloud for the scale win, but is there a “too big” when it comes to data lakes?
Data fabrics: Integrating and orchestrating data
The need for seamless data integration grew as organizations began leveraging multiple data storage technologies, from relational databases to data lakes. Data fabrics emerged in the mid-to-late 2010s as an architectural approach integrating data across disparate sources and providing real-time access through intelligent automation. Don’t read this as a unification of decentralized data platforms but as a technological solution to unify access to disparate data sources. From structured to unstructured data, data consumers want a unified data access experience; data fabric platform architectures enable that capability. In other words, we learned that not everything goes into one data lake.
Data fabrics integrate data across hybrid and multi-cloud environments, provide real-time operational analytics, and accelerate data availability for AI and ML model building and training. Data fabrics reduce the complexity of managing diverse data platforms by focusing on metadata-driven automation. Technologies such as knowledge graphs, semantic layers, and AI-driven orchestration play a key role in implementing data fabrics.
Looking back, we can see that increasing demand for scale drives the shift and expansion of data platform architectures from one to many integrated data sources. Where do we go from here? If we follow the scale curve, we’ll cross the line where managing the whole data platform from a centralized location no longer makes sense. The data-hungry organization overwhelms the central data teams, and it’s unlikely they know the data as well as their stakeholders. Let’s scale out even further.
Data meshes: Decentralizing data ownership
Traditional centralized data architectures often become bottlenecks as organizations scale. Zhamak Dehghani popularized the concept of data meshes, which address this issue by decentralizing data ownership to domain teams. Each domain is responsible for its data as a product, with well-defined interfaces for sharing and consumption. The data platform is centrally managed within a data mesh, and data products are owned by domain-driven teams (decentralized).
Data meshes emphasize domain-oriented decentralization, self-serve data infrastructure, and federated governance. Cloud platforms significantly enable these principles by providing the necessary infrastructure-as-code, API-driven tooling, security, and data governance enforcement.
When should you data mesh? If your organization has complex, domain-specific needs, you need to scale data architecture for large, distributed teams, or you’re looking to drive insights at scale using domain-driven design. Data meshes are well-to-do about town. Still, they can easily overwhelm an organization, leaving the decentralized design well implemented on paper but suffering the same bottlenecks as fully centralized data platforms in implementation. For additional reading, I have an article on why Data Mesh architectures are as much technology as they are organizational design.
A quick note on evolving data access patterns
The evolution of data architectures is also driven by changes in how data is accessed:
- From batch to real-time: Early systems focused on batch processing, while modern architectures prioritize real-time data access.
- From centralized to decentralized: Traditional architectures centralized data management, but modern approaches like data meshes distribute ownership.
- From structured to multi-structured: Organizations now handle diverse data types, including text, images, and sensor data.
These shifts highlight the importance of flexible, scalable architectures that support emerging use cases. With this insight, we can glimpse where data platform architectures are headed next.
Reading the data platform tea leaves
Let’s revisit the evolution to see where we might head next.
- Files => Databases: Structure drives insights
- Databases => Data Warehouses: Separate operations and analytics
- Data Warehouses => Data Lakes: Widen the net to unstructured data
- Data Lakes => Data Fabric: Unify access patterns to all things data
- Data Fabric => Data Mesh: Decentralize (and de-silo) data products
- Data Mesh => ?: Multimodal Analytics
We’ve already seen the push for multimodal generative AI – models that can interpret and process context seamlessly across text, images, video, and audio. From my perspective, we will move through another Data Fabric-like convergence, where technology emerges that wraps data platforms, enabling analytics regardless of source. We got a hint of this at Matt Garman’s 2024 re:Invent™ Keynote when he announced Amazon S3 Metadata, a service capable of automatically extracting metadata from objects stored on S3 into tabular files within Apache Iceberg™ supporting queries by Amazon Athena™ (and others).
Imagine a world where unstructured and structured data are streamed into your centrally managed platform with decentralized domain-driven data products comprising sensor data, drone data, satellite imagery, acoustic analysis, weather, and much more, where context is automatically extracted and made available for querying. Imagine writing a single query that inspected multimodal structured and unstructured data types, allowing you to layer data into a unified view with robust insights. We are heading there next, and AI will play a significant role.
Conclusion
From the advent of relational databases to the rise of data meshes, the evolution of data platform architectures reflects the ever-changing landscape of business needs and technological advancements. Each architecture has strengths and limitations, addressing specific use cases and challenges. The cloud has further accelerated this evolution, enabling organizations to adopt sophisticated solutions with unprecedented agility and scale.
As we look to the future, the focus will likely shift toward hybrid models combining the best centralized and decentralized paradigms alongside innovations in AI-driven data orchestration, governance, and automatic multimodal context and metadata extraction. Regardless of the architecture, the goal remains the same: enabling organizations to unlock the full potential of their data.
Regardless of where you are on your data platform journey—from databases to data meshes, from on-prem to the cloud—Pariveda has the experience to help you navigate the sea of data platform architectures and choose the architecture (or architectures) that best suits your business needs. Contact me if you want to explore what’s next in your journey.