DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 9

By
June 26, 2023

0 Comment

29. Delta Lake Introduction

Hi and welcome back. Now in this chapter I just want to go through Delta Lake when it comes to Azure databricks and we’ll see a couple of examples on Delta Lake in data bricks itself. So with the help of a Delta Lake, you get some more features when it comes to tables that are stored in Azure databricks. So some of the features that you get are so asset transactions. So here you can always ensure that you get consistent data. So when someone is actually, let’s say, performing an update on records that are being stored in a table, you can always be ensured that readers never see inconsistent data. So here you have now the feature of also having transactions on the data in your underlying tables.

Apart from this, you also have now the ability to handle all the metadata for your data itself. In addition to this, a table can be used both for your batch jobs and also for your streaming jobs as well. You also have schema enforcement to ensure that no bad records are inserted into your table. You also have this concept of time travel. So when it comes to your data, you have data versioning that helps you in terms of performing rollbacks. We’ll actually see an example on this in a later on chapter and then finally you can also perform up search and deletes on your data. So here I just want to give a quick introduction on Delta Lake. In the subsequent chapters we just see some examples on implementing Delta Lake.

30. Lab – Creating a Delta Table

Now in this chapter, I’ll show you how to create a delta lake table. So here I have the code in place. So here I want to first of all take information from a jsonbased file that I have in my Azure Data Lake Gentu storage account. So this is my Data Lake gentoo storage account. You’ve seen this earlier. I have my raw directory and I have a JSON based file. So again, this JSON based file has metrics that are coming in from our database via diagnostic settings. Again, I will ensure to keep this JSON file has a resource onto this chapter. You can upload it onto the raw directory. And if you’ve been following along, we already have the databricks scoped secret in place to access our Data Lake Gen two storage account. Now, next, if you want to now create a table. So now we are trying to create a table in Azure databricks. Here we are using the Save has table option to give the name of the table.

Now here, in terms of the format, we are saying go ahead and make this a delta table. So let me take this and let me execute this in a cell in a notebook that is attached onto my cluster. So this is complete. We now have a table in place. So let me execute now another cell. So now we can issue SQL commands against this table. So remember, this is the same diagnostic based information we had seen earlier on. So these are all metrics that were stored in that JSON based file.

Now here if you want to actually partition your table, so again, this is something that you can also do. So let’s say that you have your queries that are using the where clause where you are filtering on data, and there you are filtering on data based on the metric name. So if you want to get faster performance for your queries, you can actually partition your data on the metric name. So let’s say that your query is looking at information where in the metric name is equal to CPU underscore percent. Then, because it has now been partitioned, azure Tailor Bricks only has to go into the partition where the metric name is equal to CPU underscore percent. So in this way you can actually make your queries much more efficient. So in order to create a table with partitions, it’s very easy. Use the partition by clause. And here you tell what is a column based on which you want to create the partition.

Here I am giving another table name of partition metrics. So let me take this. Here I’ll go on to a new cell. Let me run this. So while this is running, and then you can issue commands. So here I am selecting the metric name and account from partition metrics and grouping by the metric name. So once this is done, let me run this. Let me ensure that it is a SQL based command. So we have all of this information in place. So in this chapter, so far, we have only been looking at how to create a delta table. In the next couple of chapters, we’ll see some advantages of using the delta table. Now, before I actually wrap this chapter off, I just want to give you a note. So if I go on to the compute section so we have our clusters in place. Now, if I create a cluster here, you can see that based on our runtime. So the databricks runtime Ax automatically uses delta lake as the default table format. So when creating a table, you don’t necessarily have to tell it to be a delta table. This is the default table format that will be used. But from the purpose of the exam, the perspective of the exam in terms of the command, because these commands are important, you should know the commands. That’s why I ensure to let you know on what are the commands to actually create a delta table.

31. Lab – Streaming data into the table

Here we are using the Save has table option to give the name of the table. Now here, in terms of the format, we are saying go ahead and make this a delta table. So let me take this and let me execute this in a cell in a notebook that is attached onto my cluster. So this is complete. We now have a table in place. So let me execute now another cell.

So now we can issue SQL commands against this table. So remember, this is the same diagnostic based information we had seen earlier on. So these are all metrics that were stored in that JSON based file. Now here if you want to actually partition your table, so again, this is something that you can also do. So let’s say that you have your queries that are using the where clause where you are filtering on data, and there you are filtering on data based on the metric name. So if you want to get faster performance for your queries, you can actually partition your data on the metric name. So let’s say that your query is looking at information wherein the metric name is equal to CPU underscore percent. Then, because it has now been partitioned, azure Tailor Bricks only has to go into the partition where the metric name is equal to CPU underscore percent.

So in this way, you can actually make your queries much more efficient. So in order to create a table with partitions, it’s very easy. Use the partition by clause. And here you tell what is a column based on which you want to create the partition. Here I am giving another table name of partition metrics. So let me take this. Here I’ll go on to a new cell. Let me run this. So while this is running, and then you can issue commands. So here I am selecting the metric name and the count from partition metrics and grouping by the metric name. So once this is done, let me run this. Let me ensure that it is a SQL based command. So we have all of this information in place. So in this chapter, so far, we have only been looking at how to create a delta table. In the next couple of chapters, we’ll see some advantages of using the delta table. Now, before I actually wrap this chapter off, I just want to give you a note. So if I go on to the compute section so we have our clusters in place.

Now, if I create a cluster here, you can see that based on our runtime. So the databricks runtime Ax automatically uses delta lake as the default table format. So when creating a table, you don’t necessarily have to tell it to be a delta table. This is the default table format that will be used. But from the purpose of the exam, the perspective of the exam in terms of the command, because these commands are important, you should know the commands. That’s why I ensure to let you know on what are the commands to actually create a delta table.

32. Lab – Time Travel

So in the prior chapter we had seen how we could stream data onto a delta lake table. Now, in this chapter I just want to give a quick overview on the time travel function that is available for your delta lake tables. So here let me issue the sequel statement of describing the history of the new metric table. So here when any change is actually made on to the data in the table, when it comes to delta lake tables, there are different versions that are being made about that table because it is a delta lake table. Here you can see that for each version what is the operation that is being performed.

So anytime there is a streaming update on the table, here you can see the operation and here you can see what are the operation parameters. And if you want to select data from the table as per a particular version, that is something that you can do as well. So for example, here if I select Star from metrics has of let’s say first of all version one, so I can see that I have no results because there was probably no data in that table. Let me go on to version two. And now you can see you have some data in place, so you can actually look at your data at different versions at different points in time. So this is the concept of the time travel that is also available for your delta lake tables.

33. Quick note on the deciding between Azure Synapse and Azure Databricks

So in this chapter, I just want to again go through some quick points when it comes to the comparison of maybe using the Spark pool in Azure Synapse and when it comes to using the Spark engine that is available in Azure databricks. So with Azure Synapse, you do have advantage of having everything in one place. So you can host your data warehouse with the help of ensuring that you create a dedicated SQL pool. You can also create external tables that actually points onto, let’s say, data in an Azure Storage account. You can also bring your storage accounts much more closer in Azure Synapse with the help of linking Azure Storage accounts, in this case Azure Data Lake Storage Gen Two accounts. Then you also have the integrated section wherein you can actually develop pipelines. So you can actually make use of these pipelines to copy data from a source on to a destination.

So you have everything that is in one place in Azure Synapse. Whereas in Azure databricks we have seen that we have a lot of functionality that is available and this is based on the underlying Spark engine. Also when it comes onto Azure databricks, it’s not only for data science, it can also be used for machine learning. So a lot of the frameworks which is available for machine learning is also part of the databrick service. So this is one complete solution if you are looking at data engineering, data science, machine learning. And as I mentioned before, because the people who have made Spark have also made Azure data bricks, whatever changes they make onto the Spark engine will be available always in Azure data bricks. So in this chapter, again want to go through a few points when it comes to both of these services to help you decide on which service best suits your needs.

34. What resources are we taking forward

So, again, a quick note on what I’m taking forward. So I need to discuss some monitoring aspects when it comes on to as your data bricks. And we’ll be covering this in the monitoring section. So at this point in time anyway, you can actually go ahead and delete your cluster if it’s no longer required. And then when we go on to the monitoring section, you can go ahead and recreate the cluster. But we will visit the monitoring part when it comes on to the Azure databricks service.

Category: Uncategorized

Comments

* The most recent comment are at the top

Interesting posts

The Ultimate Guide to Mastering Marketing Automation for Email Wizards

Hey there, email aficionados! Welcome to your new favorite read – the one that’s going to turbocharge your email marketing game. You’re about to dive into the captivating world of marketing automation, a place where efficiency meets effectiveness, letting you boost your campaigns without breaking a sweat. Get ready to discover how automation can not… Read More »

Master YouTube Marketing with These 10 Powerful Steps

Welcome to the dynamic world of YouTube marketing! Whether you’re a seasoned pro or just getting started, harnessing the power of YouTube can significantly boost your brand’s visibility and engagement. With over 2 billion monthly active users, YouTube offers a vast audience for your content. But how do you stand out in such a crowded… Read More »

Instagram Marketing 101: From Profile to Engagement

Hey there, Instagram enthusiast! Whether you’re a newbie or a seasoned social media guru, you probably already know that Instagram is one of the most powerful tools in your marketing arsenal. With over a billion monthly active users, it’s a goldmine for businesses looking to boost their brand, engage with customers, and drive sales. But,… Read More »

SAP Certification Exams: SAP HANA Fundamentals and Applications

Hey there! In our fast-paced digital world, SAP certifications are here to give your career a serious boost, no matter where you’re starting from. Whether you’re just getting your feet wet or you’re already a pro, these certifications validate your skills and give you the recognition you deserve. The whole idea behind the SAP certification… Read More »

Quantum Computing Fundamentals: Qiskit Certification Exam Explained

Ever heard of computers capable of solving problems in minutes that would take regular computers years? That’s the mind-bending promise of quantum computing! It’s a whole new way of using computers, and it’s opening doors in medicine, materials science, and beyond. Intrigued? If you are curious about quantum computing and want to get hands-on experience… Read More »

Cloud-Native Development: CKAD Certification Exam Preparation Guide

In today’s fast-evolving tech landscape, cloud-native development has become a pivotal skill for IT professionals. The Certified Kubernetes Application Developer (CKAD) certification is a highly sought-after credential that validates your ability to design, build, and run applications on Kubernetes. This guide will walk you through everything you need to know to prepare for the CKAD… Read More »

Related posts: