Microsoft Fabric: Data Flow Gen 2, First Thoughts.
There is a lot of buzz around Fabric as it comes into General Availability and while Microsoft maintain that they are still committed to Azure Synapse Analytics, their other big data engineering platform, the wind of change feels headed in Fabric’s direction. See here from Microsoft Fabric, explained for existing Synapse users:
Microsoft has no current plans to retire Azure Synapse Analytics. Customers can continue to deploy, operate, and expand the PaaS offering of Azure Synapse Analytics. Rest assured, should these plans change, Microsoft will provide you with advanced notice and will adhere to the support commitments in our Modern Lifecycle Policy in order to ensure our customers’ needs are met
There is a clear sense from that documentation that there is an expectation that people will move from Azure Synapse to Microsoft Fabric. Indeed there is a whole section in Microsoft’s Learn site for Migration. Missing from this section is Mapping Data Flow. These seem generally absent from the platform and its a notable omission from the platform. Mapping Data flow in Azure Data Factory and Azure Synapse are powerful RAD components for engineering data quick to learn and a favourite in organisations where python experts might be thin on the ground. That said Microsoft have produced a paper to help those developers make the transition to Dataflow Gen2 from Mapping Data Flow. This is the starting point for this discursive blog.
learn.microsoft.com Migration White Paper |
Bridging the Gap - comparing functions.
The document looks at components from the Mapping using the sections found in Mapping Data Flow Transformation.
- Multiple inputs/outputs
- Schema modifier
- Formatters
- Row modifier
- Flowlets
- Destination
Lets take these in turn.
Multiple inputs/outputs
All of the transformations but for Exists are
included. This is also missing from the list of unsupported functions. This is
not especially problematic as this simply filters the data by those which can
be found or cannot be found in the joining data stream. As such leveraging
Joins will return the same effect however the transformations needed are more
complex. This is true elsewhere and symptomatic of the shift from Mapping Data
Flow to Dataflow Gen2.
New Branch.
New Branch and The Reference Transform as Analog |
This is essentially true but there is a missing element that is key for using the reference in Power Query for Dataflow Gen2. In Synapse once a New Branch is added any subsequent transformations are only applied to the branch which they relate to. Not so for Reference. Reference is from the Query not the specific transformation in the Query. This means that any new transformations added to the Query will affect anything referenced from it. Removing a column? Watch out, it may be needed in a referenced query. So to truly replicate what is being done with New Branch we must create two references which flow independently from our base queries.Conditional Split
Conditional Split and The Reference Transform as Analog |
Our old
friend rears it's head again! Conditional Split is a further branching type and
has the same drawbacks as New Branch usage. Similarly to the Exists, the point
of the branched streams is that they filter data from the branch where it meets
one or all conditions.
- In
the case of First Matching condition you may branch multiple times safe in
the knowledge that all data in that branch adheres to the rule. Any row
not meeting any rule will travel the default path.
- In
the case of All matching conditions data will travel one path only where
is conforms to each rule and the default path is any condition is not met.
To
replicate this each new Referenced Query will require the data to be filtered.
In the case of the default branch this may require complex advanced filter
configurations.
Schema modifier
All of the schema modifiers are present in the documentation which is to be expected. There are a couple of functionality that are worth discussing.
Column pattern
The
column pattern allows for transformations to be applied to multiple columns
whenever they match the logic supplied. More details can be found in the documentation. A common
pattern I employ is to handle NULL values in data.
This
pattern looks for all number fields and then checks their value to return zero
where ever a NULL value is found. The replace values transform appears to
be a reasonable analog.
Table.ReplaceValue(#"Replaced
value", null, false,
Replacer.ReplaceValue, {"Column10", "Column14"})
The biggest difference here is that specific columns must be
included whereas the column pattern is more schema agnostic.
Select
There are a couple of things that the Select does
which I don't believe are possible in Dataflow Gen2. The first is the ability
to reference a column by its name. This is useful for mapping drifted columns or mapping columns which may or may not exist. The second perhaps more useful is to select columns by Rule Based Mapping.
Select All Example |
Select base on Name example |
Another example selects the columns based on their name and to alter the name for all matching columns. A use case would be to remove joined columns from matching logic.
Formatters
Flatten and Parse are both accounted for and the examples of JSON is a common use case. Stringify is not currently present and is listed in the exceptions.
Row Modifier
Alter Row
Assert
- fail the Data Flow
- Apply an Assert Rule to rows
Flowlet
A flowlet is a block of predeveloped transformations which can be applied in a Mapping Data flow. The documentation suggests the use of custom functions and supplies this link to how custom functions work.
There is a step in the documentation that does not function in Fabric.
This step allows the application of a parameter to determine which source to use in the function. Without this it not possible to replicate further steps against a later specified source. Further to this Flowlets are stored in Synapse as part of a library of code. As such they can be called from multiple different Mapping Data Flow. There appears not to be similar capability for Custom Functions.
Data flow use Cases
While this examples suggests using Spark Pool it would be similarly possible to use mapping data flow. Ingesting data is perhaps simpler in Copy Data Pipelines, especially on-premises data but transform is definitely simplified by using Mapping Data Flow.
Data Flow Gen 2
Microsoft suggests using pipelines and dataflow for ingestion and Notebooks and Dataflow for Transformation. Whether this is true depends on your expectation about you transformation stages.
Ingestion
- Data that grows over time containing all the records ever processed
- Data that contains only a set of records that are to be further processed then archived or disposed of.
Transform
Conclusion.
Mary is an experienced Power Query user, and the data volume is in the low to medium range to achieve desired performance. Dataflows provide no-code or low-code interfaces for ingesting data from hundreds of data sources.
In my investigations I don't find the choices they propose to be born out. There might be hope, however, that to meet these claims we see a lot of new features in Dataflows coming soon.
Comments
Post a Comment