The Azure Data Factory Data Flow Join and Lookup transformations both provide controls for aggregating data across multiple data sources.
Join Transformation
There are two AzureSQL tables from previous posts: Buildings and Maintenance_Logs. The latter has a foreign key to Buildings, Building_Id. For example:
The Join transformation gives us controls to combine these datasets using classic SQL Join options: left-outer join, left-inner join, right-outer join, full-outer, and a ‘custom’ option as well.
This is visualized below with a left-inner between Buildings and Logs: This basically gets all building and log data for every record across the two tables where a matching Building_id is found. Something like the following SQL query: SELECT * FROM Building INNER JOIN Maintenance_Logs ON Maintenance_Logs.Building_Id = Building.Building_Id.
Notice the “aggregateSum…” transformation in the screenshot above. This is included to provide a sum of the ‘Cost’ of all matching maintenance logs for each building. I have a separate post on aggregates at https://blog.westmorr.com/2024/01/09/azure-data-factory-data-flow-aggregate-and-sort/.
The results of the ADF data flow Join transformation basically match the SQL INNER JOIN statement referenced above. The only difference is order of columns in the output.
Lookup Transformation
The ADF data flow Lookup transformation behaves similarly to the Join transformation configured for a “Left Outer” join and/or a SQL LEFT OUTER JOIN query. However there are a lot of great features to utilize for adjusting the default behavior. Read the documentation here: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-lookup.
In the sample database, there are no maintenance logs for building ‘B004’. So, I will use the Lookup Transformation to get the latest log entry for every building. And, if there isn’t a match, then NULL values will appear for the right source (logs).
Notice the options selected in the screenshot below:
- Match multiple rows: Unchecked
- Match on: Last Row
- Sort conditions: Sort by the Maintenance_Log_Id which is an auto increment primary key.
The results are as expected:
Conclusion
The Azure Data Factory Data Flow Join and Lookup transformations both provide controls for aggregating data across multiple data. There are multiple options for both and can be understood to further intuit which transformation is best for a given scenario.
Categories: Azure Data Factory





Leave a Reply