Microsoft Fabric – My Notebook

This post is reference documentation for Microsoft Fabric

About the Environment

Fabric is the most significant expansion to the Power BI and analytics platform. Per MS, it is “the most significant data platform innovation since SQL Server”. It is a complete platform of all analytic workflows. Plan to touch OneLake, DirectLake, Lakehouses, Delta Lakes, and more. See below for platform features and description:

Fabric really changes the Power BI experience. A Fabric workspace and a Power BI workspace have somewhat become the same thing. A Power BI developer used to rely on others to perhaps build the data factories, write notebooks, run ETL, and so-on to populate a data warehouse. In Fabric, with things like Semantic Link, SQL and Semantic Endpoints, and the primary components shown above all being joined together…the Power BI developer is empowered to own the end-to-end data story and/or partner up with others in the same workspace to craft a solution.

Feature Review and Walkthrough

Step 1: Create a new Workspace

Step 2: CreateLakehouse

Step 3: Add Data Set to File Storage

Drill into the newly created Lakehouse root, Open “Files” context menu, upload data.

Return to the workspace home and add a new Dataflow Gen2 to launch Power Query Editory.

First, create table through Get Data > Blank Query > Advanced Editor. After pasting in M code, enter the parameters and invoke:

Next, Rename the query to, “Dates”.

Now use Get Data > Lakehouse to read in the uploaded Excel file by navigating to the Lakehouse Files node in the subsequent explorer.

Filter the resulting query to be only “Kind” = “Table”. Then, Click on the [Table] entry inside of the Orders_Table Row:

Rename the Query to “Orders”.

Copy-and-paste this query, and then change the new query by opening the “Filtered rows” step and drilling into “Products_Table”. Rename this new query to “Products”.

After this, select the Orders query and then “Merge Queries”. Use the first option, as a new query is not wanted (only the merged query).

Set the relationship between Orders and Products through the ProductId key. Use a Left Outer Join (I do have a missing product which is ignored).

Expand the Products table and choose ‘Product Name”, “Category”, and “Price”.

Remember to set all queries (including Dates) result data set via Add Data Destination or controls within the step editor:

Publish the changes, and wait for the Data Flow to complete. Once published, the new new Delta Tables will be present. Move to update the model and create line item sales totals. To do this, enter the SQL Analytics Endpoint node of the Lakehouse:

The experience is similar to the Model and Table views in Power BI Desktop.

Setup the relationship between Dates and Products

Click onto the Orders table in the Explorer. This is just for context to hold the measure, “m_LineTotal” which is used to calculate the Product Price * the Order Quantity:

m_LineTotal = SUMX(Orders, Orders[Quantity] * Orders[Price])

Select the new Measure, then go to Measure tools and update the data type to Currency

Now, move onto data visualization. Return back to the Workspace home. Click into the Semantic Model for the Lakehouse which opens up the Power BI visual editor options. Choose to create a custom, blank report:

Add totals Card for total sales, and then a category table to use as a slicer. Add clustered bar for Total Sales by Product and Pie Chart for Total Sales by category.

Moving on to Data Warehouse. A data warehouse provides full SQL DML unlike the Lakehouse.

Return to the Workspace root and select New > Warehouse

From within the warehouse, choose “Get Data” > Dataflow. Going to create a simple get operation which pulls from the Lakehouse Delta tables created above.

After returning to Power Query Editor, Ensure that the data destinations are set. Then, publish the changes.

The Warehouse is now populated:

A major difference between the Warehouse and Lakehouse architectures is the write-ability. The Warehouse SQL endpoint is write-able, but the Lakehouse endpoint is not. There have been additions recently which show some enhancements within the “New SQL Query” option for Lakehouse, but support for read/write is disparate between the two.

Connect to the Warehouse. The default is a SQL endpoint which opens Power Query Editor. Model Table, etc., as has already been done.

The Power Query Editor (i.e., SQL Endpoint) has a “Create Report” button within the canvas that is not there for Lakehouse.

PySpark and Jupyter Notebooks

Go to the Warehouse root and choose new Notebook. Connect the notebook to a Lakehouse:

Load data using spark by clicking one of the tables and using its context menu. Note that PySpark is the default, other options are available:

Similar to the measure in previous examples, I’ve added ‘LineTotal’

Trimming the dataset, again, ignoring the Null:

Using display(df) instead of df.show() gives the rich table output and export features:

After all of the transformations, save using df.write.mode('append').format('delta').saveAsTable('Invoicing')

This pushes the Invoicing table out to the Lakehouse as visible directly within the Notebook Lakehouse Explorer and within the Lakehouse itself. This is my Lakehouse:

Continuing with investigation – note that files can be opened with Spark and Pandas whereas Delta tables were Spark only:

XLSX can’t be read by Spark with out external libraries, so, switching to Pandas:

Here’s a great article on PySpark vs. Pandas and when to use which: https://medium.com/geekculture/pandas-vs-pyspark-fe110c266e5c.

Getting individual tables can be done by referencing their worksheet. This assumes the worksheet only contains these tables:

And, to write the sample back, convert it back to Spark and then write.

Parameterization and Notebooks

Passing parameters into pipelines using Notebooks is straightforward.

Create a new Notebook from Data Engineering -> Synapse Notebook.

Add the Lakehouse, then load data using PySpark:

After loading, move to the top cell and convert it to a Parameters frame:

After doing so, the block type changes:

A quick test of the placeholder for the (soon to be) passed in parameter:

Another quick test to show that a table can be created with this variable name too:

Now, set TABLENAME to an empty string, add a new Pipeline by going back to the Lakehouse root and choosing Get Data > Data Pipeline:

Not copying data, so close the subsequent dialog and “Add pipeline activity”:

Set a variable to the table name, and then invoke the Notebook:

And, done:

Power BI and Source Control

As of today, Github still isn’t supported. So, using Azure DevOps – Git.

Added a new Project in Azure DevOps. Created a local folder and cloned the empty repo using VS Code and git. E.g., git clone <copied URL from Azure DevOps>

Created a new Workspace. Find the following:

From here, configured and chose “Connect and Sync”.

Go on to create a Power BI report. Instead of saving as PBIX, save as PBIP.

Used the following data source: https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv

Enable preview feature (as of 2/6/2024):

This is Power BI Desktop projects – required to enable source control. https://learn.microsoft.com/en-us/power-bi/developer/projects/projects-overview

Push the PBIP and all other saved content to Azure DevOps repo.

Now, going back to Fabric, we see:

After synchronizing and committing from Source control, we see the table and dataset:

If then make a change to the report in Power BI service, this too is picked up and can be resynchronized via the Power BI web dashboard. Can go back to Visual Studio code now and execute git pull to sync these changes.

One Lake File Explorer

Much like there is an Azure Data Storage File Explorer, there is a One Lake File Explorer.

Shortcuts

https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts

Can connect to a couple of data sources: S3, Azure Data Lake Storage, or somewhere on OneLake. I chose to OneLake and picked the Data Lake in other workspace Workspace:

This is important because we can now go across databases essentially. Recall, for example, that there are no Linked Servers or the like in Azure SQL Databases (unmanaged).

Note that I included a Data Warehouse table inside of the files shortcuts, and it extracted it into its Parquet files!

Data Flow Gen 2 Tips

Remove Spaces from Column/Header Names

Navigate to Workspace

Create new Data Flow Gen 2

Use Get Data > Lakehouse and select a file with spaces in the header names. By default, the first column won’t be prompted.

Convert with Use first row as headers

Use “Add Data Destination”, and choose Lakehouse (or appropriate) for container.

Drill into the correct Workspace > Lakehouse. Choose Next, and the column warning will come up:

Use the “Fix It” button in the “Some column names…” business bar. Not shown in image above – at far right.

It will Automatically replace <Space> characters with underscores.

Export and Re-Use

Data Flows can be exported:

After doing so, the browser will download the PQT file.

Now, navigate to another Workspace, for example. Create New > Data Flow Gen 2.

Have the Import from a Power Query Template:

Data Wrangler with Notebooks

Create a new notebook from either Data Engineering or Data Science:

Attach the notebook to a lakehouse, and ingest data. For tables, we have Spark, and for files we have Spark and Pandas:

Loaded a delta table using Spark. Then, switch to Data tab and find the Data Wrangler tool

This launches the data frame selector:

Note, previously, could only load Pandas data frames.

See here also, top of Data Wrangler, the note regarding Pandas. Also, highlighted the various options we have for data wrangling.

As example, replace Los Angeles with Truckee. Can preview the results and the final output code:

Click apply, and then click the “Add Code to Notebook”. Another preview/confirmation is shown with our PySpark code:

After clicking add, the code appears in the notebook and can be executed.

Note that Data Wrangler also had options to Export to CSV and other features.

Dependencies and Python Packages

The dependency can be installed on a notebook level. First, the package must be installed:

pip install yfinance

After installing the package, it can be used as expected. E.g.,

However, this makes the package available only in this notebook. In order to install the package for the entire workspace for example:

From: https://learn.microsoft.com/en-us/fabric/data-engineering/library-management

Now, one must:

  1. Create a new Environment
  2. Add the libraries to that environment
  3. Attach the environment as the default to the workspace

Navigate to MS Fabric Home, and open “Data Engineering”: Then, create the environment:

After creating the environment. Add the package as shown above. Then publish to make the environment available.

Now, this Environment can be selected from within a notebook:

One should be able to add the environment to the workspace too, via Workspace Settings. However, I’m not seeing what is shown in the documentation.

See, https://learn.microsoft.com/en-us/fabric/data-engineering/create-and-use-environment#attach-an-environment

This tabbed interface is not available for me in “Spark Settings”, and I can’t find anywhere else to set the Environment as the default: The screenshot below is from the link above. These tabs are not present for me – and I’m definitely the admin.

Managing Permissions

First, drill into workspace. Can add settings at this top level:

Now, this user can for example read all of the delta tables in my lakehouse. So, what if I wanted to prevent access? This can be done by opening up the SQL endpoint and running (for example):

DENY SELECT on Orders TO [foo@dot.com]

What I found interesting is that, the effect takes place immediately in SSMS for example (using the Lakehouse SQL endpoint URL to connect):

But, it actually takes quite some time to propagate via Power BI. It’s been something like 10 minutes, and I can still connect. The documentation explains these updates can take a couple of hours.

https://learn.microsoft.com/en-us/fabric/data-warehouse/share-warehouse-manage-permissions

I think this is what’s happening as shown below:

This seems like a pretty critical issue. I’m going to test out later and see if I can still view the table. Also need to test whether I can DENY SELECT ahead of time, and only afterwards give the user viewer permissions…will this prevent them from ever having access? Note, 30 minutes later or so…I can still reconnect to that table in Power BI.

Another aside: I have to add an additional filter as this “viewer” user to get the Lakehouse to show-up. This was not the case as the admin user.

If I don’t add that endpoint, then the lakehouse will not show-up.

Paginated Reports

Open the Semantic Model for the Lakehouse. Get a Paginated Report feature directly within the canvas:

Use the Build task pane to update the visual layout and also make updates such as updating column aggregations:

Using regular Print or Export, can print to PDF and it’s multipage. This used to be complex with 3rd party plugins in Power BI.

Semantic Link

The definition is, “Semantic link is a feature that allows you to establish a connection between semantic models and Synapse Data Science in Microsoft Fabric. Use of semantic link is only supported in Microsoft Fabric”. This is from, https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-overview

See above for paginated reports. Noted that I opened the Semantic Model of the Lakehouse instead of the SQL or Live data connections. This same Semantic Link is what opens the Power BI report designer feature in the web. What’s effectively happening with Semantic Link is that there is a feature in Fabric which allows this semantics to be consumed by Python.

First,

pip install semantic-link

Now have the sempy library

https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-overview

https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-semantic-propagation

From the second documentation, get a good summary, “The SemPy Python library is part of the semantic link feature and serves pandas users. SemPy supports the operations that pandas allows you to perform on your data. Furthermore, SemPy allows you to propagate semantic data from semantic models on which you’re operating”.

This is a PowerBI Semantic model being read in the Jupyter notebook.

This has connected the Data Analyst and Data Scientist on the same dataset which is previously not seamlessly possible. The Power BI Dataset is now functioning as a pandas table!

You can also get measures using, for example, fabric.list_measures, and then even evaluate them using fabric.evaluate_measure().

The semantic model doesn’t get created if the data source lives within the Lakehouse as Delta Tables, for example. Had to publish from Power BI which started from a CSV.

Conclusion

Will keep adding to this as ramp-up on fabric continues.



Categories: Azure, Microsoft Fabric

Tags: ,

Leave a Reply

Discover more from Westmorr Consulting

Subscribe now to keep reading and get access to the full archive.

Continue reading