Data Architect: The Strategist
The role that defines where data lives, how it flows, who owns it, and how it stays trustworthy — at the scale of an entire organization.
What a Data Architect Actually Does
If the Data Engineer builds roads, the Data Architect designs the city. A Data Architect is not writing daily pipeline code — they are defining the standards, patterns, and governance policies that make hundreds of pipelines coherent over a 3–5 year horizon.
The role requires deep technical knowledge (storage formats, cloud services, modeling paradigms) combined with the ability to influence stakeholders, lead governance committees, and communicate architectural tradeoffs to executives. It is simultaneously one of the most technical and the most political roles in a data organization.
Core Concepts
1. The Medallion Architecture (Bronze / Silver / Gold)
The Medallion Architecture is the most widely adopted pattern for organizing data in a Lakehouse. It structures data into three progressively refined layers, each serving a distinct purpose:
| Layer | Also Called | Contents | Who Writes | Who Reads |
|---|---|---|---|---|
| Bronze | Raw / Landing | Exact copy of source data, immutable. No transformations. Stores original JSON, CSV, or CDC events. | Ingestion pipelines | Processing jobs, auditors |
| Silver | Conformed / Cleansed | Validated, deduplicated, standardized schema. Business rules applied. Referential integrity enforced. | dbt models, Spark jobs | Data analysts, downstream pipelines |
| Gold | Business / Curated | Aggregated KPIs, semantic models, denormalized fact tables, feature store for ML. | dbt models, semantic layer | BI tools, executives, ML models, APIs |
The key insight is that Bronze is never modified — it is your audit trail. Silver is where quality is enforced. Gold is where business logic lives. Each layer has its own access controls, retention policies, and SLA expectations.
Medallion Architecture — Full Flow
flowchart TD
subgraph EXT["External Source Systems"]
CRM["CRM Systems\nSalesforce · HubSpot"]
ERP["ERP / SAP\nTransactional DBs"]
EVT["Event Streams\nKafka · Kinesis"]
API["APIs & Webhooks\nThird-Party Data"]
end
subgraph BRONZE["Bronze Layer — Raw Zone"]
B_RAW[("Raw Parquet Files\nExact Copy · Immutable")]
B_CDC["CDC Events\nDebezium · Change Streams"]
B_META["Schema Registry\nMetadata & Lineage Tags"]
end
subgraph SILVER["Silver Layer — Conformed Zone"]
S_VAL["Validated & Deduplicated\nData Quality Checks Pass"]
S_STD["Standardized Schema\nConsistent Types & Formats"]
S_BIZ["Business Rules Applied\nReferential Integrity Enforced"]
end
subgraph GOLD["Gold Layer — Business Zone"]
G_KPI["KPI Aggregations\nFact & Dimension Tables"]
G_SEM["Semantic Layer\ndbt Metrics & Exposures"]
G_FEAT["Feature Store\nML-Ready Datasets"]
end
subgraph CONSUMERS["Data Consumers"]
BI["BI & Tableau\nDashboards"]
DS["Data Science\n& ML Training"]
EXEC["Executive Reports\nC-Suite Dashboards"]
PROD["Data Products\n& APIs"]
end
EXT --> BRONZE
B_RAW --> S_VAL
B_CDC --> S_VAL
S_VAL --> S_STD --> S_BIZ
S_BIZ --> G_KPI
S_BIZ --> G_SEM
S_BIZ --> G_FEAT
G_KPI --> BI
G_KPI --> EXEC
G_KPI --> PROD
G_SEM --> BI
G_SEM --> DS
G_FEAT --> DS
The Medallion Architecture: data is ingested raw (Bronze), cleaned and conformed (Silver), then curated for business consumption (Gold).
2. Data Warehouse vs. Data Lake vs. Data Lakehouse
These three terms describe different approaches to storing and serving analytical data. Understanding their tradeoffs is fundamental to any architectural decision:
- Data Warehouse: Highly structured, schema-on-write, excellent query performance, high cost per byte stored. Best for well-understood, stable analytical queries. Examples: Snowflake, BigQuery, Redshift.
- Data Lake: Unstructured or semi-structured, schema-on-read, cheap storage, poor governance by default. Best for exploratory work, ML, and storing raw events. Problems: Data Swamps, no ACID guarantees.
- Data Lakehouse: Open table formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID transactions, schema enforcement, and time travel to cheap object storage. Combines the cost benefits of a lake with the governance of a warehouse. This is the dominant pattern for new platforms built from 2021 onwards.
3. Data Mesh — From Centralized to Domain-Oriented
Data Mesh (introduced by Zhamak Dehghani, 2019) is an organizational and architectural paradigm, not a technology. It has four core principles:
- Domain Ownership: Each business domain (Orders, Customers, Products) owns the data it produces, rather than handing raw data to a central data team.
- Data as a Product: Each domain team treats its datasets as products — with SLAs, documentation, discoverability, and consumer-facing contracts.
- Self-Serve Data Platform: A central platform team provides the tooling, infrastructure, and standards that allow domain teams to publish and consume data products independently.
- Federated Computational Governance: Global policies (security, privacy, interoperability) are enforced through automated tooling, not manual gatekeeping.
Data Mesh solves the scaling problem of central data teams: as the number of domains grows, a single team becomes a bottleneck. In a mesh, domain teams move at their own speed while adhering to platform-wide standards.
4. Data Governance — DAMA-DMBOK & DCAM
Governance is the set of policies, processes, and organizational roles that ensure data is accurate, available, consistent, and secure across the enterprise. The two primary frameworks are:
- DAMA-DMBOK (Data Management Body of Knowledge): The most widely used framework, covering 11 knowledge areas: Data Architecture, Data Modeling, Data Storage, Data Integration, Document & Content Management, Reference & Master Data, Data Warehousing, Metadata Management, Data Quality, Data Security, and Data Governance itself.
- DCAM (Data Management Capability Assessment Model): An EDM Council standard focused on assessing and maturing an organization's data management capabilities. More prescriptive on the journey from ad-hoc to optimized maturity levels.
In practice, a Data Architect implements governance through: a Data Catalog (Alation, Collibra, DataHub), Data Steward assignments, Data Classification policies (PII, sensitive, public), access control frameworks, and automated lineage tracking.
Cloud Provider Equivalents
A Data Architect must map architectural requirements to cloud services. The following table shows the major service categories across the three hyperscalers:
| Capability | AWS | Azure | Google Cloud |
|---|---|---|---|
| Object Storage | S3 | ADLS Gen2 / Blob | Cloud Storage (GCS) |
| Managed Warehouse | Redshift | Azure Synapse | BigQuery |
| Spark / Lakehouse | EMR / Glue | Azure Databricks / Synapse Spark | Dataproc / Databricks on GCP |
| Stream Ingestion | Kinesis Data Streams | Event Hubs | Pub/Sub |
| ETL Orchestration | AWS Glue / MWAA | Azure Data Factory | Cloud Dataflow / Composer |
| Metadata / Catalog | AWS Glue Catalog | Microsoft Purview | Dataplex / Data Catalog |
| ML Platform | SageMaker | Azure ML | Vertex AI |
| Secrets / IAM | IAM + Secrets Manager | Entra ID + Key Vault | IAM + Secret Manager |
The Tool Stack, Explained
Kimball / Star Schema
Dimensional modeling approach: fact tables (events/transactions) surrounded by dimension tables (context). Optimized for BI query patterns. Gold layer tables are often star schemas.
Data Vault 2.0
Modeling methodology for enterprise data warehouses. Separates business keys (Hubs), relationships (Links), and descriptive attributes (Satellites). Highly auditable and append-only.
dbt Semantic Layer
Define reusable business metrics in YAML (dbt MetricFlow). Any BI tool or API querying the semantic layer gets consistent, governance-controlled metric definitions instead of divergent SQL logic in every dashboard.
Apache Iceberg
Open table format for huge analytic datasets. Adds ACID transactions, schema evolution, partition pruning, and time travel to files on S3 or GCS. Becoming the default open standard, supported by all major engines.
DataHub / Collibra
Data catalog and governance platforms. DataHub (open-source, LinkedIn) provides lineage, discovery, and metadata. Collibra is enterprise-grade with stewardship workflows, policy management, and compliance reporting.
Terraform
Infrastructure as Code for provisioning cloud resources. Data Architects define the platform topology (warehouses, lakes, networks, IAM) in version-controlled Terraform modules, ensuring environments are reproducible.
Architect vs. Engineer: The Data Architect defines what the Bronze/Silver/Gold zones look like, what the access policies are, and which open table format to use. The Data Engineer implements the pipelines that populate them. Both roles must understand the Medallion pattern deeply, but they own different decisions within it.
Key Skills for Data Architects in 2025
- Dimensional modeling (Kimball, Inmon, Data Vault 2.0)
- Lakehouse design (Delta Lake, Iceberg, Hudi)
- Data Mesh principles and implementation patterns
- Cloud architecture on AWS, Azure, and/or GCP
- DAMA-DMBOK knowledge areas (especially governance)
- Data Catalog and lineage tooling
- dbt Semantic Layer and metric definitions
- Data Contract design and enforcement
- Access control: RBAC, ABAC, column/row-level security
- Cost modeling for cloud data platforms
- Executive communication and storytelling
- Cross-functional stakeholder alignment