Part 3 — Data Architect: The Strategist

The Modern Data Ecosystem — Part 3

Data Architect: The Strategist

The role that defines where data lives, how it flows, who owns it, and how it stays trustworthy — at the scale of an entire organization.

What a Data Architect Actually Does

If the Data Engineer builds roads, the Data Architect designs the city. A Data Architect is not writing daily pipeline code — they are defining the standards, patterns, and governance policies that make hundreds of pipelines coherent over a 3–5 year horizon.

The role requires deep technical knowledge (storage formats, cloud services, modeling paradigms) combined with the ability to influence stakeholders, lead governance committees, and communicate architectural tradeoffs to executives. It is simultaneously one of the most technical and the most political roles in a data organization.

Core Concepts

1. The Medallion Architecture (Bronze / Silver / Gold)

The Medallion Architecture is the most widely adopted pattern for organizing data in a Lakehouse. It structures data into three progressively refined layers, each serving a distinct purpose:

Layer	Also Called	Contents	Who Writes	Who Reads
Bronze	Raw / Landing	Exact copy of source data, immutable. No transformations. Stores original JSON, CSV, or CDC events.	Ingestion pipelines	Processing jobs, auditors
Silver	Conformed / Cleansed	Validated, deduplicated, standardized schema. Business rules applied. Referential integrity enforced.	dbt models, Spark jobs	Data analysts, downstream pipelines
Gold	Business / Curated	Aggregated KPIs, semantic models, denormalized fact tables, feature store for ML.	dbt models, semantic layer	BI tools, executives, ML models, APIs

The key insight is that Bronze is never modified — it is your audit trail. Silver is where quality is enforced. Gold is where business logic lives. Each layer has its own access controls, retention policies, and SLA expectations.

Medallion Architecture — Full Flow

flowchart TD
    subgraph EXT["External Source Systems"]
        CRM["CRM Systems\nSalesforce · HubSpot"]
        ERP["ERP / SAP\nTransactional DBs"]
        EVT["Event Streams\nKafka · Kinesis"]
        API["APIs & Webhooks\nThird-Party Data"]
    end

    subgraph BRONZE["Bronze Layer — Raw Zone"]
        B_RAW[("Raw Parquet Files\nExact Copy · Immutable")]
        B_CDC["CDC Events\nDebezium · Change Streams"]
        B_META["Schema Registry\nMetadata & Lineage Tags"]
    end

    subgraph SILVER["Silver Layer — Conformed Zone"]
        S_VAL["Validated & Deduplicated\nData Quality Checks Pass"]
        S_STD["Standardized Schema\nConsistent Types & Formats"]
        S_BIZ["Business Rules Applied\nReferential Integrity Enforced"]
    end

    subgraph GOLD["Gold Layer — Business Zone"]
        G_KPI["KPI Aggregations\nFact & Dimension Tables"]
        G_SEM["Semantic Layer\ndbt Metrics & Exposures"]
        G_FEAT["Feature Store\nML-Ready Datasets"]
    end

    subgraph CONSUMERS["Data Consumers"]
        BI["BI & Tableau\nDashboards"]
        DS["Data Science\n& ML Training"]
        EXEC["Executive Reports\nC-Suite Dashboards"]
        PROD["Data Products\n& APIs"]
    end

    EXT --> BRONZE
    B_RAW --> S_VAL
    B_CDC --> S_VAL
    S_VAL --> S_STD --> S_BIZ
    S_BIZ --> G_KPI
    S_BIZ --> G_SEM
    S_BIZ --> G_FEAT
    G_KPI --> BI
    G_KPI --> EXEC
    G_KPI --> PROD
    G_SEM --> BI
    G_SEM --> DS
    G_FEAT --> DS

The Medallion Architecture: data is ingested raw (Bronze), cleaned and conformed (Silver), then curated for business consumption (Gold).

2. Data Warehouse vs. Data Lake vs. Data Lakehouse

These three terms describe different approaches to storing and serving analytical data. Understanding their tradeoffs is fundamental to any architectural decision:

Data Warehouse: Highly structured, schema-on-write, excellent query performance, high cost per byte stored. Best for well-understood, stable analytical queries. Examples: Snowflake, BigQuery, Redshift.
Data Lake: Unstructured or semi-structured, schema-on-read, cheap storage, poor governance by default. Best for exploratory work, ML, and storing raw events. Problems: Data Swamps, no ACID guarantees.
Data Lakehouse: Open table formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID transactions, schema enforcement, and time travel to cheap object storage. Combines the cost benefits of a lake with the governance of a warehouse. This is the dominant pattern for new platforms built from 2021 onwards.

3. Data Mesh — From Centralized to Domain-Oriented

Data Mesh (introduced by Zhamak Dehghani, 2019) is an organizational and architectural paradigm, not a technology. It has four core principles:

Domain Ownership: Each business domain (Orders, Customers, Products) owns the data it produces, rather than handing raw data to a central data team.
Data as a Product: Each domain team treats its datasets as products — with SLAs, documentation, discoverability, and consumer-facing contracts.
Self-Serve Data Platform: A central platform team provides the tooling, infrastructure, and standards that allow domain teams to publish and consume data products independently.
Federated Computational Governance: Global policies (security, privacy, interoperability) are enforced through automated tooling, not manual gatekeeping.

Data Mesh solves the scaling problem of central data teams: as the number of domains grows, a single team becomes a bottleneck. In a mesh, domain teams move at their own speed while adhering to platform-wide standards.

4. Data Governance — DAMA-DMBOK & DCAM

Governance is the set of policies, processes, and organizational roles that ensure data is accurate, available, consistent, and secure across the enterprise. The two primary frameworks are:

DAMA-DMBOK (Data Management Body of Knowledge): The most widely used framework, covering 11 knowledge areas: Data Architecture, Data Modeling, Data Storage, Data Integration, Document & Content Management, Reference & Master Data, Data Warehousing, Metadata Management, Data Quality, Data Security, and Data Governance itself.
DCAM (Data Management Capability Assessment Model): An EDM Council standard focused on assessing and maturing an organization's data management capabilities. More prescriptive on the journey from ad-hoc to optimized maturity levels.

In practice, a Data Architect implements governance through: a Data Catalog (Alation, Collibra, DataHub), Data Steward assignments, Data Classification policies (PII, sensitive, public), access control frameworks, and automated lineage tracking.

Cloud Provider Equivalents

A Data Architect must map architectural requirements to cloud services. The following table shows the major service categories across the three hyperscalers:

Capability	AWS	Azure	Google Cloud
Object Storage	S3	ADLS Gen2 / Blob	Cloud Storage (GCS)
Managed Warehouse	Redshift	Azure Synapse	BigQuery
Spark / Lakehouse	EMR / Glue	Azure Databricks / Synapse Spark	Dataproc / Databricks on GCP
Stream Ingestion	Kinesis Data Streams	Event Hubs	Pub/Sub
ETL Orchestration	AWS Glue / MWAA	Azure Data Factory	Cloud Dataflow / Composer
Metadata / Catalog	AWS Glue Catalog	Microsoft Purview	Dataplex / Data Catalog
ML Platform	SageMaker	Azure ML	Vertex AI
Secrets / IAM	IAM + Secrets Manager	Entra ID + Key Vault	IAM + Secret Manager

The Tool Stack, Explained

Kimball / Star Schema

Dimensional modeling approach: fact tables (events/transactions) surrounded by dimension tables (context). Optimized for BI query patterns. Gold layer tables are often star schemas.

Data Vault 2.0

Modeling methodology for enterprise data warehouses. Separates business keys (Hubs), relationships (Links), and descriptive attributes (Satellites). Highly auditable and append-only.

dbt Semantic Layer

Define reusable business metrics in YAML (dbt MetricFlow). Any BI tool or API querying the semantic layer gets consistent, governance-controlled metric definitions instead of divergent SQL logic in every dashboard.

Apache Iceberg

Open table format for huge analytic datasets. Adds ACID transactions, schema evolution, partition pruning, and time travel to files on S3 or GCS. Becoming the default open standard, supported by all major engines.

DataHub / Collibra

Data catalog and governance platforms. DataHub (open-source, LinkedIn) provides lineage, discovery, and metadata. Collibra is enterprise-grade with stewardship workflows, policy management, and compliance reporting.

Terraform

Infrastructure as Code for provisioning cloud resources. Data Architects define the platform topology (warehouses, lakes, networks, IAM) in version-controlled Terraform modules, ensuring environments are reproducible.

Architect vs. Engineer: The Data Architect defines what the Bronze/Silver/Gold zones look like, what the access policies are, and which open table format to use. The Data Engineer implements the pipelines that populate them. Both roles must understand the Medallion pattern deeply, but they own different decisions within it.

Key Skills for Data Architects in 2025

Dimensional modeling (Kimball, Inmon, Data Vault 2.0)
Lakehouse design (Delta Lake, Iceberg, Hudi)
Data Mesh principles and implementation patterns
Cloud architecture on AWS, Azure, and/or GCP
DAMA-DMBOK knowledge areas (especially governance)
Data Catalog and lineage tooling

dbt Semantic Layer and metric definitions
Data Contract design and enforcement
Access control: RBAC, ABAC, column/row-level security
Cost modeling for cloud data platforms
Executive communication and storytelling
Cross-functional stakeholder alignment

← Part 2 Data Engineer: The Builder Part 4 → ML/AI Engineer: The Deployer

← Back to Publications