Achieving Blazing Fast ETL with Incremental Data Loading in Azure Data Factory

4 min readMay 18, 2024

Implementing incremental data loading in Azure Data Factory pipelines is a powerful way to optimize ETL performance.

An incremental load first identifies the changed data to extract, then uses various lookup activities and filtering logic to only load the updated or new data into the destination.

This saves significant time as you avoid processing unchanged historic records. It also reduces costs as less data traverses the pipeline.

If you currently suffer from slow, bloated ETL processes, configuring incremental loads with Azure etl tools Data Factory is the solution.

This lets you re-architect table loading to be faster, cheaper, and more agile. You get all the speed and efficiency gains without having to overhaul entire pipelines.

Overview of Incremental Load Patterns

Incremental loading relies on logic to partition data files or database tables so that only the latest information is extracted. Azure Data Factory has native support for various incremental loading strategies:

Watermark column: Append a column like LastModifiedDate to track rows. The process only rows where WatermarkColumn > last successful run.
Change Data Capture: Use CDC logs to stream changed rows.
Partitioned tables: Load the newest partition based on date, ID range, etc.

Combine these with lookup activities, filters, and alter row transformations to achieve blazing-fast incremental loading.

As one example, you could identify the maximum primary key in a destination table. Then configure the source query to only return records greater than that value.

Pros and Cons of Incremental Loading

Compared to repeatedly processing full loads, incremental data loading provides:

Benefits

Faster pipeline runs and lower data processing costs
Only extract and load changed data
Achieve near real-time ETL
Simplify pipelines with lookups instead of full table scans

Considerations

Additional logic to manage partitions and checkpoints
Reconstruction of historic data is slower
Can increase schema complexity long-term

Overall incremental loading is perfect for optimizing ETL cycles for business analytics, data warehouses, and other common workloads. The faster processing and lower costs offset the small extra effort to configure partitioning.

Step-by-Step Implementation Guide

Follow these steps to implement a basic incremental load pattern in Azure Data Factory:

1. Add a Watermark Column to the Source Table

Identify source table to be incrementally loaded
Alter table to add new LastModifiedDate column
Default to current date/time and update on row changes

This provides a marker for changed rows.

2. Configure Pipeline DataSource

When adding a source dataset, enable partitioning
Schedule trigger for pipeline to run every 24 hours
In the source query, filter rows where:

LastModifiedDate > Pipeline().Start

AND

LastModifiedDate > @{activity(‘LookupNewWatermarkValue’).output.firstRow.NewHighWatermarkValue}

This extracts new rows based on job start time and max watermark value from last run.

3. Lookup Activity for Current High Watermark

Add Lookup activity before source extraction
Query destination table to select MAX(LastModifiedDate) AS NewHighWatermarkValue
Output to pipeline variable

Saves the current greatest watermark value.

4. Alter Row Transformation

Include Derived Column alter row transformation
Add new column MatchingWatermarkValue set to @{activity(‘LookupNewWatermarkValue’).output.firstRow.NewHighWatermarkValue}

Flags new rows matching incremental batch.

5. Configure Sink Table Upsert

In the sink dataset, set the insert method to Upsert
Define the alternate key on the Matching Watermark Value column

On insert, matches rows to update while inserting new arrivals.

This provides a framework for fast, efficient incremental loading using a watermark column in Azure Data Factory.

The key steps are filtering the source rows, identifying the latest watermark, flagging changed data, and handling upserts to the destination.

Best Practices for Production Pipelines

When implementing incremental loading for critical ETL workloads, also consider:

Audit trails — Log high watermark values for compliance.
Change data capture — Ingest from transactional systems using CDC.
Rebuilds — Reconstruct target table on schedule.
Unique keys — Define on natural keys or add GUIDs.
Re-processing — Handle pipeline retries and errors.

Be sure to monitor performance over time and tune query filters and partitions as needed.

Frequently Asked Questions

What are the limitations of incremental loading in a Data Factory?

There are fantastic built-in features for incremental ETL in Data Factory version 2. However, during the initial backfill loading historic data can require running full scans. Plan staging strategies accordingly.

When should I add partition columns instead of watermarks?

Partition columns make sense when there is a clear separation of old vs. new data built into primary keys or timestamps. This avoids altering existing schemas. Watermark-style loading adds complexity but gives you more flexibility.

Can I automate detecting changed rows instead of CDC or watermarks?

Absolutely, for some data sources, you can use features like HashBytes differencing against complete row values to locate changed rows without CDC or watermarks. The key is identifying an efficient method to filter to new arriving rows.

Recap and Next Steps

Incremental data loading is a crucial ETL optimization. By only processing changed data, you slash pipeline duration while avoiding unnecessary data transfer costs.

Start by identifying ad hoc batches and oversized ETL that hurt performance and cost efficiency.

Implement incremental loading with watermarks or change data capture to re-architect table loading as an automated, real-time process.