How Databricks Enables the Development of a Composable Customer Data Platform (CDP)

We analyze why Databricks is the ideal platform on which to build a composable CDP.

December 17, 2025

Ezequiel Fattori

Modern Data Platform

Databricks

Introduction

In the marketing industry, making use of data is fundamental. Data helps build on what works and what doesn’t, and the correct use of that data results in bigger ROI, better engagement, audience growth and wider reach.

Customer data spans web and mobile interactions, purchase histories, app activity, in-person engagements, demographic details like age, income, and gender, as well as outputs from custom data science models.

Data is the foundation on which any future platform can be built. When we say “data”, it includes ad spend and impression data from ad platforms, clickstream data from websites/apps, attribution data from tools like AppsFlyer, and customer data from CRMs or databases.
‍
That is why clean, healthy and correctly collected data is a must. Data is the foundation on which whatever platform or model is needed will be built, therefore is the foundation is shaky then everything else is standing on moving sands.

What is a Composable CDP?‍

A composable CDP is a customer data platform that enables you to use any data in your organization to power marketing use cases like audience management, journey orchestration, personalization, and data activation directly from your existing data infrastructure.

‍Why a CDP is necessary: Challenges faced by marketing teams‍

Marketing teams, often find themselves double-guessing their data or unsure about their own results. Even worse, incorrectly attributing results to actions that didn’t perform as well as interpreted.

Teams usually deal with:

Data Silos or Fragmented Data: a.k.a. Key marketing data living in disparate systems ad platform dashboards, web analytics, mobile attribution tools (AppsFlyer, etc.), CRM databases, etc
Incomplete buyer journeys: Marketers still struggle to measure which touchpoints actually drive results in today’s multichannel environment.
Incomplete customer profiles: Companies with large user bases and owned media channels struggle to drive engagement when they treat all users the same. Generic one-size-fits-all messages (blast emails, push notifications to everyone) result in low engagement and conversion. They cannot answer “which user is likely to respond to which offer on which channel at what time.”
Wrongly attributed wins to marketing actions: Traditional last-click and rule-based attribution over-credit bottom-funnel channels, ignore or fail to measure offline and non-paid efforts (like SEO, email, or discounts), undervalue branding campaigns, and are further weakened by new privacy limits. Inability to measure offline and online together makes it difficult to compare and understand which initiatives are driving results.
Inefficient Budget Allocation: Without a data-driven approach, deciding how to spread marketing budgets across channels and campaigns is often guesswork. Simple forecasting or evenly splitting budgets ignores diminishing returns and saturation points – spending beyond optimal levels yields little gain. Marketers risk overspending on low-impact channels and underspending on high-opportunity ones. Seasonal trends and external factors are hard to incorporate, leading to suboptimal media plans that waste spend or miss opportunities.
Inconsistent metrics: Many teams rely on attributed GMV, but this doesn’t reflect true incrementality or profit, since margin differences and returns are rarely considered. On top of that, different attribution models are often used for planning versus daily optimization.
Manual & Reactive Campaign Management: Managing dozens of campaigns across Google, Facebook, and other platforms is overwhelming. Marketing teams set budgets and targets and rely on platform algorithms, but continuous fine-tuning is needed as platforms don’t guarantee optimal outcomes for every advertiser. Without automation, teams end up reacting slowly .

The solution to these problems is essentially what a CDP does: it centralizes, stores, and syncs data and composes a 360° view of the customer so marketing teams know where they stand and ideally what they should do next.

What is the difference between traditional CDPs and composable CDPs?‍

A traditional CDP bundles collection, storage, modeling, and activation into a single platform. A composable CDP adopts a modular approach, integrating with existing data infrastructure like cloud data warehouses (e.g., Snowflake, Databricks). It allows businesses to select and combine best-in-class components for data collection, modeling, and activation, providing greater flexibility and scalability.

What are the main features of a composable CDP?

Real Time and Multi-Source Data Collection (Customer 360°)
Single Source of Truth
Customer Consent and Data Privacy
Data Governance and Access Control
Integrations and Connectors
Self-Service Dashboards and Report
Self-service Audience Builder‍‍
Integration with ML Models‍‍
Scalability and Maintainability

‍Composable CDPs enable real-time, multi-source data integration to build holistic Customer 360° profiles, while incorporating identity resolution to unify fragmented views across systems.

A composable CDP acts as a single source of truth, centralizing all customer-related data while respecting data consent and privacy regulations. It empowers marketers through governed, self-service access to segment audiences, build journeys, and run campaigns without waiting on data teams. Advanced access control and approval workflows ensure alignment across business units.

By integrating seamlessly with external platforms for delivery and internal ML models for smarter predictions, composable CDPs not only automate execution but also continuously learn and improve. Their modular, scalable, and maintainable architectures ensure they can handle both the scale and complexity of modern data-driven marketing.

Ultimately, a composable CDP becomes a strategic marketing engine—fueling personalization, optimizing spend, and boosting campaign effectiveness, all while preserving customer trust and ensuring compliance. For businesses aiming to modernize their marketing stack, this flexible, future-ready approach is no longer a nice-to-have—it’s a competitive necessity.
‍

How Databricks allows building a composable CDP:

Real Time and Multi-Source Data Collection (Customer 360°)

Centralizing all this customer information into a SSOT coming from different sources takes a lot of work from Data Engineers.
Databricks Lakeflow Declarative Pipelines gives Data Engineers a lot of useful tools to build ETL pipelines that are easy to develop, maintain and scale to match any input volume without extra work. Features like autoscaling of clusters, monitoring of pipeline executions, live metrics and alerts, and managed pipeline execution make the life of the developers easier.

Databricks Lakeflow Declarative Pipelines has also native support for streaming pipelines, which means that the data can be processed in real time, which is a very important feature for a CDP, as already explained.

In contrast, without Databricks engineers usually need to implement a lot of these features using a patchwork of tools and frameworks that require a lot of effort and technical expertise to operate and manage. It takes longer to get the job done, and the staffing and infrastructure costs are higher.

The table format used by Databricks, Delta Lake, offers many advantages over other formats such as CSV or Parquet. Among other things, faster queries and the possibility to restore/rollback old versions of data tables, which is very useful to quickly remediate mistakes without having to batch reprocess the whole historic data. This feature is fundamental during the process of development of the data lakehouse tables, because it is an iterative process that sometimes involves trial and error.

Databricks Autoloader supports incremental ingestion of real time events in several formats such as JSON, CSV, Parquet, XML. Autoloader reads new data coming in into a cloud storage and loads this data into Delta Lake tables without duplication. It can also infer schemas and types for the data by looking at the values and even change this schema on the fly, which is very useful for data coming from many unstable APIs from third parties. Without Databricks, the Data Engineer must take care of all these subtleties manually. They have to reinvent the wheel for each new project and write and test ingestion code which is error prone.

Databricks Lakeflow Connect is a feature that simplifies the ingestion of real time incremental data from sources like Kafka and traditional OLTP databases without much developer intervention. Without these features, engineers need to manually implement ingestions and deploy other hard to maintain infrastructure such as Apache Kafka and/or Debezium for CDC.

Finally there is also Databricks Delta Sharing which is a framework intended to simplify the process of sharing data across business units and organizations in compliance with Data Governance policies. In the case of a CDP, enriched data coming from third parties can be easily brought into the Databricks workspaces and used immediately, the tables “just appear there”.

Summarizing, Databricks makes the process of building a centralized Data Lakehouse painless and effortless for Data Engineers, so that the CDP can be implemented much faster and the resulting quality of data can be much better.

Single Source of Truth

The fundamental idea of a composable CDP, is that it is an open solution built from moving parts that can be independently customized and controlled. A company needs a Data Lakehouse as a fundamental building block of a CDP.

This Data Lakehouse will usually serve many different purposes for the company, not just be used by the marketing area. It is not just about the CDP, it is about choosing the right solution for the needs of the whole company.

Databricks is an ideal choice for this job, because it is the most advanced and better supported Data Lakehouse implementation that is nowadays available on the market.

Customer Consent and Data Privacy

Databricks support filtering of rows and masking/obscuring of columns based on attributes of the user performing the query and conditions on the data itself. This lets us easily implement a compliance solution where marketing teams can only access specific fields of data from a customer if the customer has given consent of the processing of the information for marketing purposes, otherwise the columns will be obscured.

This allows the company to enforce compliance with data protection laws and do not depend on the good will of implementers to carry out these checks, which is also very error prone.

Powered by its streaming capabilities, Databricks allows customers to update their consent choices in real time with immediate effects.

Without the features that Databricks offers, developers would need to create different authorised views for each marketing team or each user by joining the customer data with the consents table, and these views should be queried each time. This is much harder and complex to implement consistently and to maintain.

The consent table should be updated with a batch process with some frequency that would not be real time, therefore the customer will have to wait some time before seeing the impact of the changes in consent settings.

Data Governance and Access Control

Unity Catalog is the Databricks component that we use to implement Data Governance best practices on the CDP.

Unity Catalog provides a catalog for all the data present in the CDP, this allows teams to quickly explore the available information and to avoid them bypassing the CDP in order to look for information somewhere else. Additionally it provides access control, where the marketing teams request access to the owners of each piece of information of the customer. Unity Catalog also provides data lineage capabilities, which is important in order to understand how PII and other customer data flows through the CDP.

Databricks Lakeflow Declarative Pipelines provide CI/CD pipelines and version control, and built-in monitoring and observability features for these pipelines, including data lineage, update history and data quality reporting, all of these are part of the requirement set in a global Data Governance framework.

Without Databricks, all these features should be implemented with a separate component, usually proprietary, that would not be natively integrated into the Data Warehouse as Unity Catalog is.

Integrations and Connectors

Databricks Lakeflow Declarative Pipelines can be integrated in a last stage with Spark Structured Streaming to process microbatches with communication HTTP requests and send them to delivery platforms like Firebase FCM or any other. This process is scalable to any volume of communications. As long as the delivery platform can handle the throughput of deliveries, the CDP can handle them too.

On the other hand, Databricks Lakeflow Connect appliance Zerobus supports direct ingestions of feedback data from delivery platforms in the form of streams from multiple sources or OLTP databases. This is very useful to collect feedback on delivery status, for example from Firebase FCM that post their reports of delivery status on BigQuery, or to collect tracks from actions that the customers take on applications.

Without Databricks, all these features should be implemented using a patchwork of tools and frameworks that are harder to develop and maintain.

Self-Service Dashboards and Reports

Databricks has a natively integrated dashboard tool. This tool is very easy to use by non developers and business analysts with a low code interface very similar to PowerBI and a similar quality of dashboards.

The advantage is that the data is already available in the same platform, so no migration or transfer of data into another environment is needed, and no additional licences are required.

Without Databricks, dashboarding usually happens in a different platform than the Data Warehouse, therefore increasing the complexity and surface of the solution, and incurring additional costs.

Integration with ML Models

Last but not least, Databricks gives Data Scientist Developers a lot of tools to implement the whole ML model lifecycle within Databricks itself.

Databricks clusters specifically tailored to ML workloads can be provisioned. These come already set up with all the more commonly used ML frameworks and tools such as PyTorch, TensorFlow, Xgboost,Spark MLlib, scikit-learn, among others. All the coding can be done on notebooks and later moved into scheduled pipelines within Databricks. Without this feature, ML clusters must be provisioned by DevOps for the development team each time there is a need for a new cluster, and each cluster should also be maintained, increasing costs, complexity and bureaucratic obstacles.

The training of models can be performed in these clusters in a scalable and distributed fashion; it supports arbitrarily big training datasets. This distributed training uses standard libraries such as TorchDistributor, DeepSpeed, SparkML, etc.

The serving of models can be also performed in real time supported by Databricks and is horizontally scalable. Inference may use clusters with provided GPU instances for high performance inference. Using this setup, each model becomes a HTTP API that you can integrate into any web or client application. Without this feature embedded into Databricks, the development team should be assisted each time by DevOps in order to serve the model for the rest of the company, increasing costs, complexity and bureaucratic obstacles.

Databricks clusters also feature tracking of experiments and model lifecycle using ML Flow, which is a standard tool used by data scientists. This experiment and model lifecycle is tightly integrated with Unity Catalog to apply model governance practices.

Databricks also supports running queries that use functions performing text generation of foundational models, such as LLM. These models are frequently used in the context of a CDP for automatic generation of content for communications. Without this Databricks feature, inference of LLMs would be usually performed outside SQL and later on integrated with the main pipeline. This process is more complicated to scale correctly and requires more implementation effort, while using Databricks this effort is just about writing one single line of code in a SQL query.

If your accounts are exploring this shift, we are happy to share what we’ve learned building these systems across industries. Additionally, Muttdata was named LATAM AI Partner of the Year by Databricks, and our certified experts have enough expertise to help scale marketing team’s data. Let's set up a call and have a conversation!

Share article.

News & insights