Published in: Data Engineering

ETL vs ELT: Making the Right Choice for Your Data Stack

Author David Park

Published on: June 23, 2023

The ETL versus ELT debate has been running for over a decade, and the conventional wisdom has shifted dramatically during that time. Five years ago, the default recommendation was almost always ETL: extract data from sources, transform it in a dedicated processing layer (typically Spark or a Python application), then load the cleaned and transformed data into your data warehouse. Today, the default recommendation has flipped to ELT: extract data, load the raw data directly into the warehouse, then transform it there using SQL.

Article Overview

ETL vs ELT: Making the Right Choice for Your Data Stack

7 sections · Reading flow

01
Understanding the Core Difference

→

02
When ETL Is the Right Choice

→

03
When ELT Is the Right Choice

→

04
The Hybrid Approach: Using Both Patterns

→

05
Cost Comparison at Different Scales

→

06
dbt: The De Facto ELT Transform Layer

→

07
Migration Strategy: ETL to ELT

HARBOR SOFTWARE · Engineering Insights

But “default” does not mean “always,” and choosing the wrong pattern for your specific situation has real, compounding consequences — wasted engineering time building infrastructure you did not need, inflated costs from processing data in the wrong layer, and brittle pipelines that break when requirements change. The right choice depends on your data volumes, transformation complexity, team capabilities, compliance requirements, and cost structure. Here is a framework for making that decision clearly.

Understanding the Core Difference

The difference between ETL and ELT is where data transformation happens. This sounds like a minor architectural detail, but it determines your tooling choices, cost structure, debugging workflow, data governance model, and the skills your team needs. It is arguably the most consequential architectural decision in a data platform.

ETL (Extract, Transform, Load) — transformation happens in an application layer before data reaches the warehouse:

# ETL pattern: data is cleaned and shaped in Python before loading

def etl_pipeline():
    # Extract: pull raw data from source system
    raw_orders = extract_from_api('/api/orders', since='2023-06-01')

    # Transform: clean, validate, enrich, aggregate in application code
    cleaned = []
    for order in raw_orders:
        cleaned.append({
            'order_id': order['id'],
            'customer_id': order['customer']['id'],
            'total_usd': convert_currency(
                order['total'], order['currency'], 'USD'
            ),
            'order_date': parse_date(order['created_at']).date(),
            'item_count': len(order['line_items']),
            'category': classify_order(order['line_items']),
            'region': lookup_region(order['shipping_address']['country'])
        })

    # Load: insert only the clean, transformed data
    load_to_warehouse('fact_orders', cleaned)

ELT (Extract, Load, Transform) — raw data is loaded first, transformation happens inside the warehouse:

# ELT pattern: raw data loaded as-is, transformed later in SQL

def elt_extract_and_load():
    raw_orders = extract_from_api('/api/orders', since='2023-06-01')
    load_to_warehouse('raw.orders', raw_orders)  # Load exactly as received

-- Transformation happens in the warehouse via SQL (dbt model)
-- models/marts/fact_orders.sql

WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

cleaned AS (
    SELECT
        id AS order_id,
        JSON_EXTRACT_PATH_TEXT(customer, 'id') AS customer_id,
        CASE currency
            WHEN 'USD' THEN total
            WHEN 'EUR' THEN total * {{ var('eur_usd_rate') }}
            WHEN 'GBP' THEN total * {{ var('gbp_usd_rate') }}
            WHEN 'PKR' THEN total * {{ var('pkr_usd_rate') }}
        END AS total_usd,
        DATE(created_at) AS order_date,
        JSON_ARRAY_LENGTH(line_items) AS item_count
    FROM source
    WHERE id IS NOT NULL
      AND created_at IS NOT NULL
)

SELECT * FROM cleaned

In ETL, transformation logic lives in application code — Python, Java, Scala, or Spark jobs. In ELT, transformation logic lives in SQL inside the data warehouse. This distinction cascades into every aspect of how you build, test, deploy, and maintain your data platform.

When ETL Is the Right Choice

ETL remains the better pattern in specific scenarios where the warehouse cannot (or should not) be the transformation layer:

1. Transformations that SQL cannot express efficiently. Machine learning inference on each record, natural language processing, image analysis, calling external APIs to enrich data during transformation, complex algorithmic computations — these require application code. If your pipeline calls an ML model to classify each record, geocodes addresses via a third-party API, or runs NLP extraction on text fields, that logic does not belong in SQL and cannot be efficiently pushed to the warehouse.

2. Significant data volume reduction before loading. If you extract 100GB of raw data daily but only need 5GB after filtering and aggregation, transforming before loading saves substantial warehouse storage and query compute costs. This matters when your warehouse bills per byte stored (all of them) and per byte scanned (Snowflake, BigQuery). At $5 per TB scanned on BigQuery, scanning 100GB instead of 5GB costs 20x more for every downstream query.

3. Compliance requirements mandate pre-load masking. PII masking, encryption, tokenization, and data redaction often need to happen before data enters the warehouse for regulatory compliance (GDPR, HIPAA, PCI-DSS). If your data governance policy or legal team prohibits storing raw PII in the warehouse — even temporarily — you must transform (mask, encrypt, or remove) before loading.

# ETL example: mandatory PII masking before warehouse loading
import hashlib

def mask_pii_fields(record: dict) -> dict:
    """Remove or mask PII fields before warehouse loading."""
    return {
        **record,
        'email': hashlib.sha256(record['email'].lower().encode()).hexdigest(),
        'full_name': None,  # Completely removed
        'phone': None,      # Completely removed
        'ip_address': '.'.join(record.get('ip', '0.0.0.0').split('.')[:2]) + '.0.0',
        'address_city': record.get('address', {}).get('city'),  # Keep city only
        'address_country': record.get('address', {}).get('country'),
    }

4. Real-time or near-real-time processing requirements. If transformed data needs to be available within seconds of the source event (not minutes or hours), a stream processor (Kafka Streams, Apache Flink, or a simple Python consumer with Redis) running ETL is the right pattern. Warehouse-based ELT transformations run on batch schedules — even with micro-batching, the latency floor is typically 1-5 minutes.

5. Constrained warehouse compute budget. Some organizations have hard warehouse cost limits set by finance. Moving compute-heavy transformations (large joins, window functions over billions of rows, complex aggregations) to cheaper infrastructure (spot EC2 instances at $0.03/hour versus on-demand warehouse compute at $2-4/credit) can be significantly more cost-effective, especially for transformations that run once and whose output is small relative to input.

When ELT Is the Right Choice

ELT has become the default recommendation for good reasons, and it is the correct pattern for the majority of analytical data pipelines in 2023:

1. Modern warehouses have enormous compute capacity. Snowflake, BigQuery, Redshift, and Databricks can process terabytes of data in minutes using massively parallel query execution. The transformation compute that required a dedicated Spark cluster five years ago now runs in a SQL query that the warehouse auto-scales. For most transformation workloads, you do not need separate processing infrastructure at all.

2. SQL is dramatically more accessible than Python or Spark. Most data analysts know SQL — it is the lingua franca of data work. Far fewer have the Python proficiency to write, debug, and maintain data transformation code in application frameworks. ELT with dbt lets analysts own and maintain the transformation layer using a language they already know well, reducing the bottleneck on data engineering capacity.

3. Raw data preservation enables unlimited iteration. When you load raw data first, you can always re-transform it with different logic. If business requirements change (new product categories, updated currency rates, different aggregation windows, new compliance rules), you re-run the transformation SQL against the preserved raw data. With ETL, the raw data was discarded after transformation — changing the logic requires re-extracting from source systems, which may not retain historical data or may rate-limit re-extraction.

-- ELT advantage: create new analyses without re-extraction
-- Original model: daily order summary
SELECT DATE(created_at) AS order_date, SUM(total) AS daily_revenue
FROM raw.orders GROUP BY 1;

-- New requirement 6 months later: weekly summary by new category scheme
-- Just write a new SQL model against the same raw data
SELECT
    DATE_TRUNC('week', created_at) AS order_week,
    new_category_v2(line_items) AS category,  -- New UDF
    SUM(total) AS weekly_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM raw.orders
GROUP BY 1, 2;

4. dbt has professionalized SQL-based transformation. dbt (Data Build Tool) provides version control, automated testing, documentation generation, dependency management, and lineage tracking for SQL transformation models. It turns ad-hoc SQL scripts into a proper software engineering practice with CI/CD pipelines, pull request reviews, and automated quality checks. The dbt ecosystem — dbt Cloud, dbt packages (like dbt-utils and dbt-expectations), and the active open-source community — has made the ELT transformation layer more productive than equivalent ETL code for most use cases.

5. Clean separation between extraction and transformation. ELT pipelines have two loosely coupled stages: extract-and-load (handled by dedicated tools like Fivetran, Airbyte, Stitch, or custom scripts) and transform (handled by dbt). Each stage is independently testable, deployable, and debuggable. ETL pipelines tend to be monolithic — the transformation logic is tightly coupled to the extraction output format, making it harder to change one without breaking the other.

The Hybrid Approach: Using Both Patterns

In production, most mature data stacks use both ETL and ELT for different data flows. The pragmatic question is not “which pattern should we use” but “which pattern should we use for which data flow.”

Our typical client architecture uses both patterns, routed by data flow characteristics:

Database sources with CDC (change data capture) or standard SaaS APIs that have Fivetran/Airbyte connectors: ELT
Data requiring PII masking before warehouse loading: ETL for the masking step, then standard ELT for all downstream transformation
Transformations requiring external API calls or ML model inference: ETL
All analytical transformations (aggregations, joins, window functions, business logic): ELT via dbt
Real-time event streams requiring sub-minute latency: ETL with a stream processor
Large-volume sources where 90%+ of data can be filtered before loading: ETL for the filter step to reduce warehouse costs

Cost Comparison at Different Scales

Cost is frequently the deciding factor, and the math is less intuitive than it seems. The comparison depends heavily on data volume, transformation complexity, and how your warehouse bills:

ETL costs include: compute for the transformation layer (EC2 instances, Kubernetes pods, or a Spark cluster — typically $500-2,000/month for daily batch jobs), storage for intermediate data, and the engineering time to build, test, debug, and maintain transformation code in Python/Java/Scala. Application code is harder to debug than SQL (stack traces in distributed Spark jobs are notoriously opaque), so engineering overhead is higher.

ELT costs include: additional warehouse storage for raw data (20-40% more than ETL since you store both raw and transformed), warehouse compute for running transformation queries (variable, depends on warehouse pricing model), and dbt Cloud if used ($100-500/month). Engineering overhead is lower because SQL is simpler and dbt provides structure, testing, and documentation out of the box.

For most teams processing under 1TB of data per day, ELT is cheaper in total cost of ownership because engineering time savings outweigh the additional warehouse compute. The warehouse handles the compute elastically — you pay for what you use, no idle infrastructure. Above 1-2TB/day, the economics shift because warehouse compute scales linearly with data volume, while ETL compute can be optimized with smarter partitioning, incremental processing, and early filtering that reduces the volume before heavy transformation.

dbt: The De Facto ELT Transform Layer

If you choose ELT for your analytical transformations, dbt is the standard tool for the transform layer. A brief practical overview for those evaluating it:

dbt models are SQL SELECT statements stored as files in a version-controlled repository. dbt compiles them into CREATE TABLE or CREATE VIEW statements, resolves dependencies between models (using the {{ ref('model_name') }} syntax), and executes them in the correct order. Each model can have associated tests that run automatically:

-- models/staging/stg_orders.sql
-- Staging layer: standardize raw data types and field names

SELECT
    id AS order_id,
    LOWER(TRIM(customer_email)) AS customer_email,
    CAST(total AS DECIMAL(10, 2)) AS total_amount,
    UPPER(currency) AS currency_code,
    CAST(created_at AS TIMESTAMP) AS created_at,
    CASE
        WHEN status IN ('completed', 'shipped', 'delivered') THEN 'fulfilled'
        WHEN status IN ('pending', 'processing') THEN 'open'
        WHEN status IN ('cancelled', 'refunded') THEN 'cancelled'
        ELSE 'unknown'
    END AS order_status
FROM {{ source('raw', 'orders') }}
WHERE id IS NOT NULL AND created_at IS NOT NULL

# models/staging/schema.yml -- tests and documentation
version: 2

models:
  - name: stg_orders
    description: "Standardized orders from the raw API data"
    columns:
      - name: order_id
        description: "Unique order identifier from the source system"
        tests:
          - unique
          - not_null
      - name: order_status
        tests:
          - accepted_values:
              values: ['fulfilled', 'open', 'cancelled', 'unknown']
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

dbt’s testing framework is what makes ELT maintainable at scale. Every model has tests that run automatically in CI/CD before deployment. A failed test blocks the deployment, preventing bad transformation logic from producing incorrect data in production tables that downstream dashboards and reports consume. This testing discipline is what transforms ad-hoc SQL scripts into a reliable, maintainable data transformation platform.

Migration Strategy: ETL to ELT

If you are currently running ETL and considering migrating to ELT, here is the approach we recommend based on multiple successful migrations:

Audit your transformations. Categorize each one as “expressible in SQL” or “requires application code.” In our experience, 70-80% of transformations fall into the SQL-expressible category.
Start with new pipelines. Build all net-new data pipelines using the ELT pattern from day one. Do not try to migrate everything simultaneously.
Migrate the simplest ETL jobs first. Pick transformations that are pure data manipulation — filtering, joining, aggregating, type casting — and rewrite them as dbt models. This builds team familiarity with dbt and demonstrates the productivity gains.
Keep complex transformations as ETL indefinitely. Jobs that call APIs, run ML models, or require PII handling can remain ETL forever. There is no rule that everything must be one pattern. A clean interface between ETL (pre-load) and ELT (post-load) is perfectly maintainable.
Establish raw data landing zones. Set up dedicated schemas in your warehouse for raw data landing (e.g., raw_api, raw_database, raw_files). Configure extraction tools to load into these schemas without any pre-transformation. This is the foundation that enables future ELT models.

Conclusion

The ETL versus ELT decision is not about which pattern is inherently superior — it is about matching the pattern to the specific constraints and requirements of each data flow. ELT is the right default for analytical data pipelines because modern warehouses are absurdly powerful, SQL is broadly accessible, raw data preservation enables iteration, and dbt provides the engineering rigor that transformation code needs. ETL remains essential for use cases that require pre-load processing: PII compliance, ML inference, external API enrichment, real-time streaming, and significant volume reduction.

Most production data stacks end up hybrid. Accept that from the beginning, design clean interfaces between the patterns, and choose the right one for each data flow based on the decision criteria above. The worst outcome is not picking the suboptimal pattern for a specific pipeline — it is spending six months debating the philosophical question of ETL versus ELT instead of building data pipelines that deliver business value.

One practical observation from our work across multiple data platforms: the teams that succeed with data infrastructure are the ones that make reversible decisions quickly and iterate based on real production experience. If you choose ELT for a pipeline and discover six months later that the warehouse transformation costs are unsustainable, you can move that specific transformation to an ETL pattern without rebuilding everything. If you choose ETL and later realize the transformation could be simpler SQL in the warehouse, migration to ELT via dbt is straightforward because the raw data (which you should be storing regardless of pattern) is available for reprocessing.

The data engineering landscape will continue evolving — new warehouse capabilities, new transformation tools, new cost models. The architecture that serves you best is not the one that makes the theoretically optimal choice today, but the one that preserves flexibility to adapt as your understanding of the problem deepens and as the technology landscape shifts beneath you. Build for learning, not for permanence.

If you are starting a new data platform today with no legacy constraints, start with ELT. Set up Fivetran or Airbyte for extraction and loading, dbt for transformation, and your warehouse of choice (Snowflake for most teams, BigQuery if you are a Google Cloud shop, Redshift if you are deep in the AWS ecosystem). Load everything raw, transform in SQL, and add ETL components only for the specific data flows that genuinely require pre-load processing. This approach gets you to production fastest, gives your team the most flexibility, and keeps your architecture simple until you have enough production experience to know where complexity is actually needed.

ETL vs ELT: Making the Right Choice for Your Data Stack

Understanding the Core Difference

When ETL Is the Right Choice

When ELT Is the Right Choice

The Hybrid Approach: Using Both Patterns

Cost Comparison at Different Scales

dbt: The De Facto ELT Transform Layer

Migration Strategy: ETL to ELT

Conclusion

You may also like

Leave a comment Cancel reply