← Back to Series Overview
The Modern Data Ecosystem — Part 3

Data Architect: The Strategist

The role that defines where data lives, how it flows, who owns it, and how it stays trustworthy — at the scale of an entire organization.

What a Data Architect Actually Does

If the Data Engineer builds roads, the Data Architect designs the city. A Data Architect is not writing daily pipeline code — they are defining the standards, patterns, and governance policies that make hundreds of pipelines coherent over a 3–5 year horizon.

The role requires deep technical knowledge (storage formats, cloud services, modeling paradigms) combined with the ability to influence stakeholders, lead governance committees, and communicate architectural tradeoffs to executives. It is simultaneously one of the most technical and the most political roles in a data organization.

Core Concepts

1. The Medallion Architecture (Bronze / Silver / Gold)

The Medallion Architecture is the most widely adopted pattern for organizing data in a Lakehouse. It structures data into three progressively refined layers, each serving a distinct purpose:

LayerAlso CalledContentsWho WritesWho Reads
Bronze Raw / Landing Exact copy of source data, immutable. No transformations. Stores original JSON, CSV, or CDC events. Ingestion pipelines Processing jobs, auditors
Silver Conformed / Cleansed Validated, deduplicated, standardized schema. Business rules applied. Referential integrity enforced. dbt models, Spark jobs Data analysts, downstream pipelines
Gold Business / Curated Aggregated KPIs, semantic models, denormalized fact tables, feature store for ML. dbt models, semantic layer BI tools, executives, ML models, APIs

The key insight is that Bronze is never modified — it is your audit trail. Silver is where quality is enforced. Gold is where business logic lives. Each layer has its own access controls, retention policies, and SLA expectations.

Medallion Architecture — Full Flow

flowchart TD
    subgraph EXT["External Source Systems"]
        CRM["CRM Systems\nSalesforce · HubSpot"]
        ERP["ERP / SAP\nTransactional DBs"]
        EVT["Event Streams\nKafka · Kinesis"]
        API["APIs & Webhooks\nThird-Party Data"]
    end

    subgraph BRONZE["Bronze Layer — Raw Zone"]
        B_RAW[("Raw Parquet Files\nExact Copy · Immutable")]
        B_CDC["CDC Events\nDebezium · Change Streams"]
        B_META["Schema Registry\nMetadata & Lineage Tags"]
    end

    subgraph SILVER["Silver Layer — Conformed Zone"]
        S_VAL["Validated & Deduplicated\nData Quality Checks Pass"]
        S_STD["Standardized Schema\nConsistent Types & Formats"]
        S_BIZ["Business Rules Applied\nReferential Integrity Enforced"]
    end

    subgraph GOLD["Gold Layer — Business Zone"]
        G_KPI["KPI Aggregations\nFact & Dimension Tables"]
        G_SEM["Semantic Layer\ndbt Metrics & Exposures"]
        G_FEAT["Feature Store\nML-Ready Datasets"]
    end

    subgraph CONSUMERS["Data Consumers"]
        BI["BI & Tableau\nDashboards"]
        DS["Data Science\n& ML Training"]
        EXEC["Executive Reports\nC-Suite Dashboards"]
        PROD["Data Products\n& APIs"]
    end

    EXT --> BRONZE
    B_RAW --> S_VAL
    B_CDC --> S_VAL
    S_VAL --> S_STD --> S_BIZ
    S_BIZ --> G_KPI
    S_BIZ --> G_SEM
    S_BIZ --> G_FEAT
    G_KPI --> BI
    G_KPI --> EXEC
    G_KPI --> PROD
    G_SEM --> BI
    G_SEM --> DS
    G_FEAT --> DS
      

The Medallion Architecture: data is ingested raw (Bronze), cleaned and conformed (Silver), then curated for business consumption (Gold).

2. Data Warehouse vs. Data Lake vs. Data Lakehouse

These three terms describe different approaches to storing and serving analytical data. Understanding their tradeoffs is fundamental to any architectural decision:

  • Data Warehouse: Highly structured, schema-on-write, excellent query performance, high cost per byte stored. Best for well-understood, stable analytical queries. Examples: Snowflake, BigQuery, Redshift.
  • Data Lake: Unstructured or semi-structured, schema-on-read, cheap storage, poor governance by default. Best for exploratory work, ML, and storing raw events. Problems: Data Swamps, no ACID guarantees.
  • Data Lakehouse: Open table formats (Delta Lake, Apache Iceberg, Apache Hudi) add ACID transactions, schema enforcement, and time travel to cheap object storage. Combines the cost benefits of a lake with the governance of a warehouse. This is the dominant pattern for new platforms built from 2021 onwards.

3. Data Mesh — From Centralized to Domain-Oriented

Data Mesh (introduced by Zhamak Dehghani, 2019) is an organizational and architectural paradigm, not a technology. It has four core principles:

  • Domain Ownership: Each business domain (Orders, Customers, Products) owns the data it produces, rather than handing raw data to a central data team.
  • Data as a Product: Each domain team treats its datasets as products — with SLAs, documentation, discoverability, and consumer-facing contracts.
  • Self-Serve Data Platform: A central platform team provides the tooling, infrastructure, and standards that allow domain teams to publish and consume data products independently.
  • Federated Computational Governance: Global policies (security, privacy, interoperability) are enforced through automated tooling, not manual gatekeeping.

Data Mesh solves the scaling problem of central data teams: as the number of domains grows, a single team becomes a bottleneck. In a mesh, domain teams move at their own speed while adhering to platform-wide standards.

4. Data Governance — DAMA-DMBOK & DCAM

Governance is the set of policies, processes, and organizational roles that ensure data is accurate, available, consistent, and secure across the enterprise. The two primary frameworks are:

  • DAMA-DMBOK (Data Management Body of Knowledge): The most widely used framework, covering 11 knowledge areas: Data Architecture, Data Modeling, Data Storage, Data Integration, Document & Content Management, Reference & Master Data, Data Warehousing, Metadata Management, Data Quality, Data Security, and Data Governance itself.
  • DCAM (Data Management Capability Assessment Model): An EDM Council standard focused on assessing and maturing an organization's data management capabilities. More prescriptive on the journey from ad-hoc to optimized maturity levels.

In practice, a Data Architect implements governance through: a Data Catalog (Alation, Collibra, DataHub), Data Steward assignments, Data Classification policies (PII, sensitive, public), access control frameworks, and automated lineage tracking.

Cloud Provider Equivalents

A Data Architect must map architectural requirements to cloud services. The following table shows the major service categories across the three hyperscalers:

CapabilityAWSAzureGoogle Cloud
Object StorageS3ADLS Gen2 / BlobCloud Storage (GCS)
Managed WarehouseRedshiftAzure SynapseBigQuery
Spark / LakehouseEMR / GlueAzure Databricks / Synapse SparkDataproc / Databricks on GCP
Stream IngestionKinesis Data StreamsEvent HubsPub/Sub
ETL OrchestrationAWS Glue / MWAAAzure Data FactoryCloud Dataflow / Composer
Metadata / CatalogAWS Glue CatalogMicrosoft PurviewDataplex / Data Catalog
ML PlatformSageMakerAzure MLVertex AI
Secrets / IAMIAM + Secrets ManagerEntra ID + Key VaultIAM + Secret Manager

The Tool Stack, Explained

Kimball / Star Schema

Dimensional modeling approach: fact tables (events/transactions) surrounded by dimension tables (context). Optimized for BI query patterns. Gold layer tables are often star schemas.

Data Vault 2.0

Modeling methodology for enterprise data warehouses. Separates business keys (Hubs), relationships (Links), and descriptive attributes (Satellites). Highly auditable and append-only.

dbt Semantic Layer

Define reusable business metrics in YAML (dbt MetricFlow). Any BI tool or API querying the semantic layer gets consistent, governance-controlled metric definitions instead of divergent SQL logic in every dashboard.

Apache Iceberg

Open table format for huge analytic datasets. Adds ACID transactions, schema evolution, partition pruning, and time travel to files on S3 or GCS. Becoming the default open standard, supported by all major engines.

DataHub / Collibra

Data catalog and governance platforms. DataHub (open-source, LinkedIn) provides lineage, discovery, and metadata. Collibra is enterprise-grade with stewardship workflows, policy management, and compliance reporting.

Terraform

Infrastructure as Code for provisioning cloud resources. Data Architects define the platform topology (warehouses, lakes, networks, IAM) in version-controlled Terraform modules, ensuring environments are reproducible.

Architect vs. Engineer: The Data Architect defines what the Bronze/Silver/Gold zones look like, what the access policies are, and which open table format to use. The Data Engineer implements the pipelines that populate them. Both roles must understand the Medallion pattern deeply, but they own different decisions within it.

Key Skills for Data Architects in 2025

  • Dimensional modeling (Kimball, Inmon, Data Vault 2.0)
  • Lakehouse design (Delta Lake, Iceberg, Hudi)
  • Data Mesh principles and implementation patterns
  • Cloud architecture on AWS, Azure, and/or GCP
  • DAMA-DMBOK knowledge areas (especially governance)
  • Data Catalog and lineage tooling
  • dbt Semantic Layer and metric definitions
  • Data Contract design and enforcement
  • Access control: RBAC, ABAC, column/row-level security
  • Cost modeling for cloud data platforms
  • Executive communication and storytelling
  • Cross-functional stakeholder alignment
← Back to Publications