Microsoft Fabric: Data Flow Gen 2, First Thoughts.

 

A graphical representation of a data flow from Microsoft Fabric
Data Flow Gen 2 Diagram View


There is a lot of buzz around Fabric as it comes into General Availability and while Microsoft maintain that they are still committed to Azure Synapse Analytics, their other big data engineering platform, the wind of change feels headed in Fabric’s direction. See here from Microsoft Fabric, explained for existing Synapse users:

Microsoft has no current plans to retire Azure Synapse Analytics. Customers can continue to deploy, operate, and expand the PaaS offering of Azure Synapse Analytics. Rest assured, should these plans change, Microsoft will provide you with advanced notice and will adhere to the support commitments in our Modern Lifecycle Policy in order to ensure our customers’ needs are met


There is a clear sense from that documentation that there is an expectation that people will move from Azure Synapse to Microsoft Fabric. Indeed there is a whole section in Microsoft’s Learn site for Migration. Missing from this section is Mapping Data Flow. These seem generally absent from the platform and its a notable omission from the platform. Mapping Data flow in Azure Data Factory and Azure Synapse are powerful RAD components for engineering data quick to learn and a favourite in organisations where python experts might be thin on the ground. That said Microsoft have produced a paper to help those developers make the transition to Dataflow Gen2 from Mapping Data Flow. This is the starting point for this discursive blog.

A list of pages from learn.microsoft.com on how to migrate components from Synapse to Fabric
learn.microsoft.com Migration White Paper

Bridging the Gap - comparing functions. 

The document looks at components from the Mapping using the sections found in Mapping Data Flow Transformation.

  • Multiple inputs/outputs 
  • Schema modifier 
  • Formatters 
  • Row modifier 
  • Flowlets 
  • Destination

Lets take these in turn.

Multiple inputs/outputs

All of the transformations but for Exists are included. This is also missing from the list of unsupported functions. This is not especially problematic as this simply filters the data by those which can be found or cannot be found in the joining data stream. As such leveraging Joins will return the same effect however the transformations needed are more complex. This is true elsewhere and symptomatic of the shift from Mapping Data Flow to Dataflow Gen2.

New Branch.
Screenshot from Microsoft's documentation comparing the New Branch Transform to the Reference Transform
New Branch and The Reference Transform as Analog

This is essentially true but there is a missing element that is key for using the reference in Power Query for Dataflow Gen2. In Synapse once a New Branch is added any subsequent transformations are only applied to the branch which they relate to. Not so for Reference. Reference is from the Query not the specific transformation in the Query. This means that any new transformations added to the Query will affect anything referenced from it. Removing a column? Watch out, it may be needed in a referenced query. So to truly replicate what is being done with New Branch we must create two references which flow independently from our base queries.
Conditional Split

Screenshot from Microsoft's documentation comparing the Conditional Split Transform to the Reference Transform
Conditional Split and The Reference Transform as Analog

 

Our old friend rears it's head again! Conditional Split is a further branching type and has the same drawbacks as New Branch usage. Similarly to the Exists, the point of the branched streams is that they filter data from the branch where it meets one or all conditions.

  • In the case of First Matching condition you may branch multiple times safe in the knowledge that all data in that branch adheres to the rule. Any row not meeting any rule will travel the default path. 
  • In the case of All matching conditions data will travel one path only where is conforms to each rule and the default path is any condition is not met.

To replicate this each new Referenced Query will require the data to be filtered. In the case of the default branch this may require complex advanced filter configurations.

 

Schema modifier

All of the schema modifiers are present in the documentation which is to be expected. There are a couple of functionality that are worth discussing.

Column pattern

The column pattern allows for transformations to be applied to multiple columns whenever they match the logic supplied. More details can be found in the documentation. A common pattern I employ is to handle NULL values in data. 


 

This pattern looks for all number fields and then checks their value to return zero where ever a NULL value is found.  The replace values transform appears to be a reasonable analog.

Table.ReplaceValue(#"Replaced value", null, false, Replacer.ReplaceValue, {"Column10", "Column14"})

The biggest difference here is that specific columns must be included whereas the column pattern is more schema agnostic.

Select

There are a couple of things that the Select does which I don't believe are possible in Dataflow Gen2. The first is the ability to reference a  column by its name. This is useful for mapping drifted columns or mapping columns which may or may not exist. The second perhaps more useful is to select columns by Rule Based Mapping.

Rule Based mapping with a true match condition and a concatenated name expression
Select All Example

In the above example we select every column simply by using true as a match condition. We can then prepend each column with the value "incoming_". This is both fast but also allows for agnostic handling of data.
A rule based mapping using names for matching and replace to remove text found in the name.
Select base on Name example


Another example selects the columns based on their name and to alter the name for all matching columns. A use case would be to remove joined columns from matching logic.

Formatters

Flatten and Parse are both accounted for and the examples of JSON is a common use case. Stringify is not currently present and is listed in the exceptions.

Row Modifier

Filter and Sort are addressed however Alter Row is not included nor is Assert.

Alter Row

Alter row is a transform that applies insert, delete, update, and upsert policies on rows. These are held as Meta Data and define what happens to the row when it is written to the sink. Due to constraints of Destinations it makes no sense to have an analog as the only policy available is insert. More on this later.

Assert

The Assert transform allows you to build a set of logical checks on the data. These rules can be used to either:
  • fail the Data Flow
  • Apply an Assert Rule to rows
Rules can be applied to specific values so can catch values for incomplete data. Rules can check for uniqueness to ensure no duplication. Rules can check for the existence of values in other streams to ensure referential integrity. The outcome of rules can be added as additional columns in the data either by any error or for specific errors based on customisable Assert ID, think error code.

To replicate this would require significant effort and may not even be possible.

Flowlet

A flowlet is a block of predeveloped transformations which can be applied in a Mapping Data flow. The documentation suggests the use of custom functions and supplies this link to how custom functions work.

Documentation for Flowlet and Custom Function as its replacement

There is a step in the documentation that does not function in Fabric.
Image from documentation to select file type

This step allows the application of a parameter to determine which source to use in the function. Without this it not possible to replicate further steps against a later specified source. Further to this Flowlets are stored in Synapse as part of a library of code. As such they can be called from multiple different Mapping Data Flow. There appears not to be similar capability for Custom Functions. 

Data flow use Cases

Mapping Data flow can be use in multiple settings and in the Medallion pattern can be used at each stage.
Medallion Lake House using Synapse
While this examples suggests using Spark Pool it would be similarly possible to use mapping data flow. Ingesting data is perhaps simpler in Copy Data Pipelines, especially on-premises data but transform is definitely simplified by using Mapping Data Flow.

Data Flow Gen 2

Dataflow Gen2 Use Cases - Ingest and Transform Identified by Microsoft



Microsoft suggests using pipelines and dataflow for ingestion and Notebooks and Dataflow for Transformation. Whether this is true depends on your expectation about you transformation stages.

Ingestion

I am assuming that Ingestion will produce raw data for the medallion pattern. In this case I would expect data to be either:
  1. Data that grows over time containing all the records ever processed
  2. Data that contains only a set of records that are to be further processed then archived or disposed of.
In both of these cases I think Dataflows perform well. 
Destination Methods
Destinations either Append or Replace the data meeting Case 1 (append) or Case 2 (replace). There is no sense of update on the destination data, perhaps suggesting the absence of the Alter Row transformation.

At this time Dataflow Gen 2 is the only way to get data from on-premises data sources such as SQL. Given this, today Ingestion using Dataflow is a must.

Transform

If we wish to enrich or curate our data then we would be looking for data to be stored in a way that allows us to update existing data. Consider a type 2 Slowly Changing Dimension; in the SCD2 pattern we would want to update existing to set them as not active or apply an end date. For this to be achieved currently we would need to ingest all new data merge it to all existing data handle changes then replace the whole table. Given that data in Fabric's lake is Delta by default this makes little sense. Right now, I would not use Dataflow Gen2 for transformations that require updates.

Conclusion.

Dataflow Gen2 provide a comfortable experience for those that have experience with Power BI and do come with a lot of inbuilt data transformation. If I compare them to Mapping Data flow, I do not find them to be as flexible nor as powerful. A big draw back is debugging limitations which make it easier to find the specific error in Mapping Data flow than Dataflow Gen2. Microsoft's decision guide makes a lot of claims for Dataflow and rightly highlight the link between them and Power Query users.

Mary is an experienced Power Query user, and the data volume is in the low to medium range to achieve desired performance. Dataflows provide no-code or low-code interfaces for ingesting data from hundreds of data sources. 


In my investigations I don't find the choices they propose to be born out. There might be hope, however, that to meet these claims we see a lot of new features in Dataflows coming soon.

Comments

Popular posts from this blog

Semantic Model Building Using Direct Lake

So, You Want to Pass the DP-600 Exam.