Domain-Driven Data

Yeah, it's data mesh

In my software engineering days, I was a big proponent of Domain-Driven Design (DDD), particularly the part that went beyond tactical code design patterns. In brief, DDD is a methodology for building software that emphasizes modeling software after a business domain, expressing business concepts in code using the same nouns and verbs the business uses (the ubiquitous language of a domain). On the other hand, data management focuses on data integration and warehousing, drawing connections between apparently loosely-related subdomains. The goal is a unified view of the enterprise so that business intelligence and data science can surface insights that cut across an entire company. Can data management learn a thing or two from DDD? Is it the other way around? Or is this a case where what is suitable for software and what is good for data are just plain different?

Bounded Contexts and Data Mesh

One huge difference in philosophy between the DDD and traditional data warehousing is that the former says it is unrealistic to adopt a common understanding of terms across an organization, and the latter says that conformity is a Good Thing, maybe even the Best Thing. DDD insists on drawing a bounded context around each business subdomain. Each term should have one unambiguous and well-defined meaning within a bounded context. Across subdomains, however, terms can be reused with different meanings.

For example, suppose a company develops an e-commerce solution. In that case, order within the storefront subdomain (a customer’s request for products) can mean something different than order within the workflow subdomain (a sequencing of tasks). Often, team boundaries are drawn along the lines of a bounded context. In the e-commerce example, there might be separate storefront and workflow teams.

Terms are not shared between bounded contexts

The idea of a data mesh has gained significant traction in the data world in recent months and years. The core idea of data mesh is the decentralized management of data domains. While DDD often leads to different teams focusing on single subdomains, the implication of data mesh is even more explicitly organizational: the team that owns the creation of the data owns the management of that data, and they are responsible for creating data products to expose that data.

This is a significant departure from the traditional data warehousing philosophy, which emphasizes integration between subdomains within the data warehouse. In that case, there would be a single conformed order dimension (probably drawn from the storefront domain), and the workflow’s order concept would be called something like workflow_order, or it may become a property of workflow_task itself. Even in more flexible data modeling practices like Data Vault, it is assumed that there is a single enterprise data model. This is not a bad thing, but it puts the burden of integrating all subdomains on a single, centralized data team. That data team works in the global enterprise domain in which they need to disambiguate all terms and draw all connections.

Integrating the enterprise data model in a data warehouse

I’m sorry, Kimball, but I think the DDD folks are on to something. You can move faster with less risk by constraining the boundary for which any one team is responsible to a single subdomain. Data model changes are far more manageable, and the team changing the model is the same team that owns the data products that may be affected. Additionally, each domain can use the most natural language to describe its entities and relationships.

What Must Be Centralized?

The promises of data mesh sound great, but surely there is still a need to centralize some things, right?

Most sufficiently large organizations have had to deal with master data management (MDM) in some capacity. Other organizations that have adopted BI/DW practices have seen value from using conformed dimensions, which arte entities with business-wide definitions. In both cases, some entities are shared among many different subdomains or are made up of data from multiple subdomains. What solutions might DDD bring to the table?

Shared Kernel

In Domain-Driven Design, there are several well-known patterns for integrating services. One typical pattern is the shared kernel, in which multiple services share a common portion of their domain model. Suppose all of the data domains in a business are using a common data platform. Setting aside a single schema with conformed dimensions that any team can consume in their data products is relatively easy to accomplish. This method is simple but requires either central management of the shared data or a high degree of communication between teams that use the shared kernel.

Separate Domains

The other integration patterns from Domain-Driven Design presume that the shared data is owned by some subdomain and consumed by others. In the conformist pattern, the domain that owns the shared data defines the model, and the consumers, well, conform to whatever the owner requires. This model nicely fits the idea that certain models should be conformed across the business.

Two other patterns do not require conformity between the producer/owner and the consumer: anti-corruption-layer and open host service. In an anti-corruption layer, the consumer maintains a component that translates the upstream data models into a more convenient model (dbt, anyone?). In the open-host service, the owner exposes a very general data model that is extended to meet the needs of new consumers, which follows the spirit of maintaining backward-compatible data contracts.

In a small company, using the shared kernel pattern may suffice. However, to avoid a centralized data team becoming a bottleneck, common enterprise data models should live in their own bounded context(s). In other words, MDM and other “enterprise-level” data should operate as if they occupied separate bounded contexts.

Centralizing enterprise data

Treating enterprise data as a subdomain

When enterprise data concerns like MDM are treated as just another subdomain, they can be managed like every other subdomain without becoming a bottleneck. To successfully apply this pattern, a robust platform layer of standard data systems - data warehouse, message bus, etc. - is necessary so that subdomains can integrate data in the way that makes the most sense to them. This is similar to how Domain-Driven software systems often use a messaging middleware like Apache Kafka to integrate.

Integration between bounded contexts in DDD

What’s Next?

At least at a high level, the values behind Domain-Driven Design are shared by data mesh. It will be interesting to see the emerging patterns and tools to support data mesh, particularly in small and medium-sized businesses with limited data staff. Outside of bounded contexts with a ubiquitous language, we could explore other potential insights from DDD - such as how to apply an anti-corruption layer using data transformation tools and data contracts.

In an upcoming post, I also plan to explore the social/organizational implications of adopting DDD-inspired data mesh patterns. Stay tuned!