Our Chief Technology Officer (CTO) Mr. Anil Saraswathy hosted a deep dive webinar on the topic ‘Mastering Modern Data Architecture on the Cloud’. From this webinar, you will gain a comprehensive understanding of the latest cloud-based data architectures, including data lakes, data warehouses, ETL data pipelines with AI/ML-based data enrichment, and more. You will also learn about the benefits and challenges of each approach, and how to select the right one for your organization’s needs.
So thanks for joining this webinar on modern data architectures for the cloud, I am Anil Saraswathy, Chief Technology Officer at InApp Information Technologies. We are a 500-strong consulting services company with offices in India, US, and Japan. So the agenda for today is as follows.
We kick off proceedings with the motivation for a modern data architecture. Then we introduce the Data Lake and Lake House concepts, and compare that with data warehouse concepts before getting into the data governance and further details on the data architecture. We then go through the AI/ML pipelines, finally touching upon the various use cases of modern data architecture.
I’ll be referencing both AWS and Azure services while going into the implementation details. While also mentioning open-source alternatives should you need an on-premise type of solution.
So without further ado, let’s get started.
So what is the definition of a modern data architecture?
One definition is that it is a cloud-first data management strategy designed to deliver value through visual insights across the landscape of available data while ensuring governance, security, and performance.
If you look closely at this sentence, a few key phrases come to the surface. Visual insights – These days, analytic insights on data are typically generated through AI models and visualized through some kind of dashboard.
Now, when you consider the landscape of available data you are basically talking about your own data coming out of your enterprise CRM, accounting, sales and marketing, emails, log files, social media, and so on, and so forth. And data governance is about who can securely create and curate this data and share them among organization units within your own enterprise or among customers and suppliers so as to enable them to generate Visual insights.
So data is everywhere, right from new data sources, which grows exponentially, getting increasingly diverse and being analyzed by different applications, but limited by the scale and cost as data volumes grow. Enterprises are struggling with numerous data sources and limited processing power to transform and visualize their data.
And the trend with data democratization is pushing enterprises to be able to get the relevant data products into the hands of people within your enterprise as well as outside. And within the same enterprise, you might have several data silos. You might have a new requirement for which you build another application that creates another data silo.
There is so much duplication of code and data among these applications that typically don’t talk to each other. There will be teams independently maintaining these monolithic applications where you probably might be using the same transactional database for reporting purposes as well, resulting in poor performance.
And this is where the relevance of a cloud-based platform that allows you to build several data products at scale from a variety of your data sources comes into the picture.
Now what are the types of data that enterprises typically deal with?
These days for deriving new insights from your data, enterprises look not just at the transactional data, but also a lot of unstructured data as well. For example, comments from e-commerce sites about your product, support requests, and responses coming by e-mail log files from your applications, Customer service chats, surveys, and so on and so forth.
Traditionally data warehouses are used to store data in the form of facts, measures, and dimensions, FMD for short, derived from your transactional databases. Sometimes new tables over and above what you have in your transaction databases need to be synthesized.
For example, if an AI/ML model is used to predict the customer churn rate and the churn rate is
tracked over time, the churn rate can be considered a fact.
If a model is used to segment your customers based on their behavior and preferences, the derived segment, like let’s say students, can be considered a dimension so that you could filter your facts by that segment. If a model is used to predict the lifetime value of a customer, the predicted lifetime value can be considered a measure.
So suffice it to say that facts, measures, and dimensions are your bread and butter if you are a data warehouse architect or if you’re considering modern data architecture.
Traditionally, data warehouses used to be built through ETL which is short for Extract, Transform, Load. With modern data architectures, we are seeing more of an ELT approach which is Extract, Load, and then Transform and that’s where we see the concept of data lakes.
You load raw data from different sources into your data lake and then do a series of transformations to curate the data. Probably build a data warehouse with facts, measures, and dimensions or sometimes query directly from your raw data. And you know the use cases may be different for different customers. I recall that in one scenario the data warehouse like AWS Redshift was the data source for a data lake once in a particular customer.
So what’s a Data lake?
A data lake is a large and centralized repository that stores all structured, semi-structured, and unstructured data at scale. Data Lakes provide a platform for data scientists, analysts, and other users to access and analyze data using a variety of tools and frameworks.
Such as AWS EMR, AWS Redshift, Athena, and Azure Synapse, which by the way combined several tools like Azure Data Factory and Azure SQL Data Warehouse. These all were different products before, now they have kind of bundled everything into Synapse and several machine learning libraries like Pietorch, Tensorflow, and Keras, and orchestrated by frameworks such as ML Flow, Sage Maker, or Azure ML.
An important aspect of data lakes is this concept of schema on write versus schema on read. Schema on write refers to the traditional approach of defining the schema or structure of the data before writing it to the database or data warehouse. This means that data must conform to the predefined schema before it can be loaded into the database.
Although schema on write is typically used in relational databases where the schema changes must be carefully managed to maintain data integrity, some data lakes like Delta Lake and Snowflake which are not relational do offer this as the schema enforcement feature.
In contrast, schema on read is an approach where data is stored in its raw form or unstructured form and the schema is applied at the time the data is read or queried.
This means that data can be loaded into the database without the need for any predefined schema. Instead, the schema is defined at the time of analysis of the query based on the requirements of the user. The main advantage of schema on reading is its flexibility. It allows for the storage of various types of data including unstructured data such as text, images, and videos. It also allows for easier integration of data from various sources as there is no need to predefine the schema.
You could of course transform it and store it in a database that enforces schema. You must know that schema on read requires a bit more processing power and time, to query the data as the schema must be applied at the time of analysis. Of course, schema on read typically supports options where columns can be ignored whether or not they exist in the table or their schema has changed.
So now let’s now consider a few options for the data lake storage. We all know that S3 is highly scalable and can store virtually unlimited amounts of data. This makes it an ideal storage solution for a data lake where large amounts of data are typically stored. It is also a cost-effective storage solution with pay-as-you-go pricing that allows you to only pay for what you use.
It is designed for high durability with multiple copies of data stored across multiple availability zones. S3 provides several security features such as encryption, access control, and auditing that make it a secure storage solution for sensitive data, which is particularly important for a data link. Plus, we need to consider that S3 integrates seamlessly with other services not just from AWS but from other providers like Azure and GCP as well.
Typically we use S3 as a data lake storage solution to store Parquet files, which is famous for its columnar storage, compression, schema evolution, interoperability, and its wide range of data type support. When working with large data sets, parallel processing systems like Apache Spark could make use of its parallelization and write a data frame into different partitions where each partition gets saved as a parquet file in a possibly hierarchical partition folder on S3.
We now look at ADLS Gen. 2 which is short for Azure Data Lake Storage Gen. 2. It’s a cloud-based storage service provided by Microsoft Azure. It is a scalable and cost-effective Data Lake storage solution that provides the ability to store and manage large amounts of data in a central location, much like S3.
ADLS Gen. 2 also provides a hierarchical file system on top of Azure BLOB Storage, which allows for improved performance and better data management capabilities. ADLS Gen. 2 can store and manage petabytes of data, making it a scalable solution for large-scale data processing workloads. And just like S3, ADLS Gen. 2 also offers a tiered pay-as-you-go pricing model.
It has Azure Active Directory integration, Role-based access control, and Data encryption that help protect data at rest and in transit, much like how IAM protects S3. It is compatible with a wide range of tools including Azure Data Breaks and Azure Synapse Analytics, both of which are powerful tools in this space.
Now I want to introduce Delta Lake also, which is the open-source storage layer that brings reliability, performance, and scalability to data lakes. It was developed at open source by Data Bricks, the company behind the popular Apache Spark, which is now a Linux foundation project.
Delta Lake is built on top of object storage platforms like S3, ADLS Gen. 2, and HDFS. It stores data in Delta Lake tables which are a type of table optimized for data Lake workloads where the storage file format is parquet and hence it can be stored in S3 or ADLS Gen 2.
Delta Lake supports ACID transactions which guarantee that data is always in a consistent state even in the face of concurrent rights. Delta Lake can enforce schema constraints at the right time, ensuring that data is always in the correct format should you require it.
It also supports schema on read, which allows data of any schema to be loaded into a delta table. One unique feature of Delta Lake is time travel, which is the ability to access and query previous versions of your data, allowing users to easily roll back changes or analyze data changes over time.
As you can see in the diagram, tables are organized in the form of partition folders and files within the folder. Each folder is a partition that could be hierarchical in nature.
The diagram shows a daily partition format. The advantage of partitioning is that read queries need to go through only the parquet files in the relevant partition, thus improving query performance. A special folder called delta_log hosts the transaction logs which are JSON files containing commands like add file and remove file corresponding to adding and removing records in the table.
Like in this case of adding records to a table. In the 1st transaction, 2 parquet files containing the actual table records were added and in a subsequent commit the two old files were removed and the third one was added as part of say compaction. You might be wondering how with all these tiny files being created every time you write a few records, how does it scale?
And the answer is Apache Spark with its paralyzation of computing.
So delta table you know uses Apache Spark on top of S3 for, you know, giving the scalability. And Delta Lake also automatically creates a checkpoint file that represents the state of the table after every 10 or so transactions. And that’s why during querying the Delta Lake API is able to read the last checkpoint, get a snapshot of the table, and read only those parquet files that are required and return the results.
The Delta Lake API internally runs Spark jobs that run in parallel working with partitions of data stored in Delta Lake storage, which could be S3, ADLS Gen. 2, or HDFS. And that explains how it can handle huge data workloads.
Now we come to data governance. Data governance refers to the set of processes, policies, standards, and guidelines that organizations put in place to ensure that their data is managed effectively, efficiently, and securely throughout its life cycle.
Data governance focuses on ensuring that data is accurate, consistent, and complete and that it meets the organization’s quality standards. It also addresses issues related to data security and privacy, such as data classification, access control, encryption, and data anonymization.
You want to ensure that sensitive data is protected from unauthorized access and that data privacy regulations are complied with. It also covers the entire data lifecycle, from data creation and capture to archival and deletion. This involves establishing policies and procedures for data retention, data backup and recovery, and data disposal.
So AWS has introduced a product, a new service called Lake Formation which is a fully managed service that allows users to build, secure, and manage data lakes on AWS. With Lake Formation’s tag-based approach, you can create tags that define data access policies for resources such as databases, tables, and columns. You can then assign these tags to specific AWS Identity and Access Management users or roles.
Allowing you to control access to data based on tags rather than resource-specific permissions. For example, you could create a tag named, let’s say sensitive, and assign it to a specific database or table containing sensitive data. You could then create an IAM policy that grants read-only access to that database or table or column with the sensitive tag to a specific group of users or roles.
And Lake Formation’s tag-based approach also supports hierarchical tags, allowing you to create tags with parent-child relationships. With Lake Formation’s named resource-based approach, you can create named resources such as databases, tables, and columns and then define data access policies for these resources. You can grant permissions to specific IAM users or roles for each named resource. Allowing you to control access to data at a very granular level.
For example, you could create a named resource for a specific table and grant read-only access to a specific group of IAM users or roles. Lake Formation’s named resource-based approach also supports resource hierarchies, allowing you to create resources with parent-child relationships.
Azure Data Share is an equivalent service provided by Microsoft Assure that enables organizations to share data with internal and external partners securely and efficiently. The shared data can be in any format, including files, folders, and databases, and can be accessed by recipients through their own Azure storage accounts. Assure Data Shared provides granular access control over the shared data, allowing data owners to control access, set expiration dates, and revoke access to data that they own at any time.
It also provides features such as activity logs, notifications, and monitoring to help data owners keep track of data-sharing activities. What we are essentially saying is that enterprises should formulate a strategy for coming up with one or more data products. Meaning each data product will be a collection of databases and tables and columns. And it could be a single database, it could be a group of tables, you know, it could be anything, it could be a group of columns you know, and then come up with clear ownership on who has the rights to create them and update the particular data product and who can read the date.
So with that background, we come to the modern data architecture. We would start with a couple of managed data architecture solutions on Azure followed by AWS.
So in this diagram, in step one, data is uploaded from the different data sources to the data landing zone.
In Step 2, the arrival of the data files could trigger the Azure Data Factory, which is now part of Azure Synapse, to process the data and store it in the Data Lake in the Core Data zone. Alternatively, you could have an ADF pipeline itself extract data from a connected data source. In step three, Azure Data Lake stores the data source, the raw data that is obtained from the different sources in Step 4.
The arrival of data in the Data Lake could trigger another Azure Synapse pipeline. Azure Synapse is a single service that allows you to ingest, transform, manage, and serve data for your business intelligence and machine learning needs. It is actually a package of several tools.
For example, it incorporates Azure Data Factory for the ETL pipelines and Apache Spark, If you’re using the Spark pool. There is also the dedicated SQL pool and the serverless SQL pool as shown in the diagram. Where the dedicated SQL pool gives you a highly partitioned and scalable RDBMS that can be used as the data warehouse, and the Serverless pool is a scalable engine to query your ADLS storage directly.
In this example, it runs a Spark job or notebook as your Synapse pipelines convert the data from the Bronze zone to the Silver zone and then to the Gold zone in Step 5.
It is shown the Spark job or notebook that runs the data processing job. Note that the data curation or a machine learning training job could also run in Spark. In step 6, a Serverless SQL pool creates external tables that use the data stored in the Delta Lake. The Serverless SQL pool provides a powerful and efficient SQL query engine and can support traditional SQL user accounts or Azure Active Directory user accounts.
In step 7, Power BI connects to the serverless SQL pool to visualize the data. It creates reports or dashboards using the data in the Data Lake House. In step 8, data analysis analysts or scientists can log in to Azure Synapse Studio to further enhance the data.
To further enhance the data analyzed to gain business insight or train a machine learning model and in step 9, business applications connect to a serverless SQL pool and use the data to support other business operation requirements. In summary, the slide shows you how data can be ingested and transformed using powerful paralyzed engines and then published for access through a distributed storage.
I think I missed one slide on three different architectures, yeah, I think this is one I missed. So let me just give you this because this is an important slide. So this diagram actually shows three data architectures. The 1st is the traditional data warehouse approach where structured data from your CRM, warehouse management system, and so on would be transformed via ETL and loaded into a data warehouse with possibly facts, dimensions, and measures which are then queried and dashboards are built.
So that’s a pretty common standard type of architecture. The second block features a data lake and a data warehouse with a data virtualization approach included. Here, the first thing to note is that you are not just ingesting structured data, but unstructured and semi-structured data as well.
They’re ingested into the data lake in the form of Parquet files, which is a columnar storage format with compression. It needs to be noted that Parquet supports schema evolution, which allows you to add or remove columns from a Parquet file representing a table without having to rewrite the entire file. This is particularly useful in environments where data schemas may evolve over time.
You see two subsequent processes, one that builds a data warehouse via ETL just like before. The 2nd is the building of a logical data warehouse, also called data virtualization. Many products, such as Azure Synapse from Microsoft or Redshift from Amazon, or Athena again from Amazon, allow you to create what are called external tables without storage, but instead link them to these parquet files in the data lake.
You can also see that the ingested data is being used for machine learning pipelines in order to enrich the tables either in the data lake or in the data warehouse. So this is what I was mentioning. Interestingly, we recently encountered A use case where a data warehouse like Redshift was one of the data sources for a data lake architecture, which is a bit of a deviation from the above architecture. In this particular diagram, data warehouses is being not shown as a data source.
So the third block in the diagram above features a data lake house with a medallion approach, something that was put forth by data bricks. Here, the main difference from the second approach is the introduction of the bronze, silver, and Gold data layers, plus the introduction of data governance.
The bronze is for raw data, Silver for the refined data, and Gold for the final trusted and curated data.
For example, the bronze data could be your point of sale data, POS data, transactional data from suppliers or social media, mentions of your products, and so on. Silver data could be your FMDlike facts that mention dimensions like sales data filtered by product category or region, supplier performance metrics, and so on.
And the gold could be your customer segmentation based on purchasing behavior, demand forecasting for specific products or regions, and recommendations for related products. So those could be your gold layer. Now each of these layers could also have distinct policies as to who is allowed to access, who is allowed to create, curate, and read, which databases, which tables in that database, which columns in those tables, and so on and so forth, who can modify them and so on as part of data governance.
Ideally, it’s a gold layer that is exposed to APIs and applications for consumption. So that was a pretty important slide that I missed. Sorry for that. So now we go on to continue with our data architecture. I think we covered this one.
So we are now into a real-time data architecture and a different architecture in Azure where in step one Debezium connectors can connect to different data sources and tap into changes as they happen using the Change Data Capture mechanism. In Step 2, the connectors extract change data and send the captured events to Azure Event Hubs. Event Hubs is a Microsoft streaming solution much like Amazon’s Kinesis and can receive large amounts of data from multiple sources.
In step 3, Event Hubs directly stream the data to Azure Synapse Analytic Spark Pools or can send the data to an Azure Data Lake Storage Landing Zone in raw format. In step 4, other Batch data sources can use Azure Synapse Pipelines to copy the data to Data Lake Storage and make it available for processing.
In Step 5, Azure Synapse Spark Pools use fully supported Apache Spark Structured Streaming APIs and Structured Streaming is good for real-time processing. It is used to process data in this using this Spark Streaming framework. Those are APIs that support real-time use cases like aggregations, like if you want to build a table let’s say that keeps track of how many logins per minute are happening.
Or in the case of an IoT application, maybe device alarms per minute. So you want to, you know, display a dashboard that shows you the up-to-date real-time device alarms per minute. If there are too many alarms, maybe you know notify the administrator, and so on and so forth.
The data processing step incorporates data quality checks and high-level business rule validation validations. In step 6 Data Lake Storage stores the validated data in the Open Delta Lake format.
Delta Lakes provides asset semantics and transactions, scalable metadata handling, and unified streaming and batch data processing for data stored in ADLS Gen. 2.
Like I said before, Delta Lake works on top of storage mechanisms like S3 and ADLS, and HDFS.
In step 7, data from the Data Lake Storage Validated zone is transformed and enriched with more rules into its final processing state loads to a dedicated SQL pool for running large-scale analytical queries.
So dedicated SQL pool from Synapse is you know, one service that allows you to run very large-scale analytical queries in a very quick time. So the only downside of a dedicated SQL pool is that you know you need to provision in advance, whereas the serverless SQL pool is actually pay-as-you-go.
And it can autoscale, so the dedicated SQL pool is equivalent to the traditional SQL Data warehouse, giving you automatic partitioning, column store indexes, dedicated computing resources, and many other goodies.
And note that this serverless SQL pool is a pay-as-you-go service that is more suited for ad hoc queries. In step 8, Power BI uses the data exposed through the dedicated SQL pool to build enterprise-grade dashboards and reports.
In step 9 you can also use the captured raw data in the Data Lake store landing zone or the validated data in the delta format for further ad hoc and exploratory analysis through Azure Synapse SQL Serverless pools and do machine learning through Azure ML.
In step 10, for some low latency requirements, you might need data to be denormalized for single-digit server latencies. These are all special use cases, but it’s just that the diagram shows you that these are also available.
So this usage scenario is primarily for API responses, quick API responses. The scenario queries documents in a NoSQL data store such as Azure Cosmos DB for single-digit millisecond responses.
And in step 11, you can augment the solution by indexing the data that the APIs need to access with Azure Cognitive Search, which allows you to create search indexes over data stored in a variety of databases like Cosmos DB, Azure Data Lake, Azure SQL Database and so on and so forth.
In summary, this covers an architecture with all the various options using Azure services thrown in.
OK, so now we go to AWS. One thing to realize is that irrespective of the provider whether it’s AWS or Azure or GCP, the solution architecture generally involves a data lake, an ETL tool, an ML OPS pipeline that will come to ML OPS shortly, a query processing engine, a data warehouse, which is optional if you could afford to query the data lake itself and Identity and Access Management to control the access.
So that’s pretty much it, you know, irrespective of the provider. So in this case, in step one, data stored in the Amazon S3 data lake is crawled using an AWS Glue crawler, which is a managed service that can discover and catalog your metadata.
It’s part of the AWS Glue service, which is the ETL service within AWS, much like Azure Data Factory. So AWS Glue is the de facto standard for ETL in AWS. In Step 2, the crawler infers the metadata of data on Amazon S3 and stores it in the form of a database and tables in the AWS Glue Data Catalog.
In step 3, you register the Amazon S3 bucket as a data Lake location with Lake Formation. There exists a data catalog in Lake Formation as well, which is kept In Sync with the one in AWS Glue. So data catalogs are always useful. You know you can look up your data where it is, you know what format, what’s the partition, what’s the data format, storage for storage format, and so on and so forth. In Step 4, you use Lake Formation to grant permissions at the database, table, and column levels to define AWS IAM rules.
In step five, you create external schemas within Amazon Redshift, a data warehouse in AWS to manage access for marketing and finance teams. So it must be noted that the external schemas in Redshift don’t have native storage as it simply links to the S3 storage.
In step 6, you provide access to the marketing and Finance groups to their respective external schemas and associate the appropriate IAM roles to be assumed. The Admin role and admin group is limited to administration work.
In step 7, Marketing and Finance users now can assume their respective IAM roles and query data using the SQL Query editor to their respective external schemas inside Amazon Redshift. Of course, the data could also be accessed using AWS Athena which is a pay-as-you-go autoscale type of service.
OK AI/ML. So I’m sorry, I think I think I skipped one. Yeah, it’s again the different look at modern data architecture in AWS.
Here you notice that there are three AWS accounts namely a data producer account, a central Federated governance account, and a data consumer account. There could also be more than one data consumer account if you can make your suppliers and customers also part of the ecosystem.
So in step one, data source locations hosted by the producer are created within the Data Producers AWS Blue Data Catalog and registered with Lake Formation. In Step 2, when a data set is presented as a product, producers create Lake Formation Data Catalog entities which are database tables, columns, and attributes within the central governance account. So this makes it easy to find and discover catalogs across consumers.
However, this doesn’t grant any permission rights to catalogs or data to all accounts or consumers, and all grants are to be authorized by the producer. In Step 3 the Central Lake Formation Data Catalog shares the Data Catalog resources back to the producer account with required permissions via Lake Formation Resource links to metadata tables and databases. In Step 4, Lake Formation permissions are granted in the central account to produce a role persona, such as the Data Engineer role to manage the schema changes and perform data transformations like ALTER, DELETE, and UPDATE on the central data catalog.
In Step 5, producers accept the resource shared from the central governance account so that they can make changes to the schema at a later time. In step 6, data changes made within the producer account are automatically propagated into the central governance copy of the catalog.
In Step 7, based on a consumer access request and the need to make data visible in the consumer’s AWS Blue Data catalog because that’s where consumers look up data products. The central account owner grants Lake Formation permissions to a consumer account based on either direct entity sharing or based on tag-based access controls. We covered tag-based access control in Lake Formation.
Which can be used to administer access via controls like data classification, and cost centers. So you can have A tag for cost centers, you can have A tag for data classes, or you can have A tag for environments like dev, prod, and so on.
In step 8, lake formation in the consumer account can define access permissions from these data sets for local users to consume. Users in the consumer account like data analysts and data scientists can query this data using their chosen tools such as AWS, Athena, or Redshift.
So the key takeaway here is how Lake Formation is able to regulate access to various data products generated by data producers in your enterprise and make them available for data analysts and scientists to build products and services.
So let’s take a closer look at how AI/ML revolutionizes the modern data pipeline.
We talked about generating new insights from data through AI/ML. Therefore, enterprises must have a strategic plan for the different AI/ML models that they need as part of the data architecture. However, building AI/ML applications is not as simple as a traditional web or mobile app. As you can see, there are all these faces before the model can be used for prediction. There’s the data ingestion, then the preparation, then feature engineering. If you’re not using a deep learning model, model training, model evaluation and tuning, deployment, inference, and then monitoring the model and retraining. You know if there’s a drift.
So there are several challenges. One faces while building machine learning models. You will need some sort of a machine learning operations pipeline that must enable retraining and fine-tuning. Your pipeline must be scalable, kind of similar to the scale that we have been discussing all through.
There has to be governance like who has access to what and you have a myriad of tools like Tensorflow, Pie, Torch, Sky Kit, and Learn Keras for the machine learning part and a set of frameworks such as Sage Maker, Azure ML, and open source ML Flow for packaging and deployment. So ML Flow is an open-source platform for the complete machine learning lifecycle which includes experiment tracking, reproducible runs, and model packaging and deployment.
ML flow tracking allows you to track experiments and the parameters, code, and data associated with each experiment. With ML flow tracking, you can log metrics and artifacts associated with each run, making it easy to compare and reproduce experiments. An ML flow project is a standard format for packaging data science code in a reusable and reproducible way. An ML flow project is essentially a directory containing code, a conda environment file, and any other resources that the code needs to run.
With ML Flow projects, you can track the version of your code, dependencies, and input data, and easily reproduce your experiments. An MF flow model is a standard format for packaging machine learning models, including the code, dependencies, and artifacts needed to serve the model. It supports Spy, Torch, Tensorflow, Sky Kit, and Learn ONNX formats and ML flow models can be deployed to a variety of target environments, such as via a REST API, a Docker image on target systems like Sage Maker and Azure ML, and even on edge devices like Raspberry Pi.
ML4 Registry is a central repository for managing models, including versioning and access control. With ML4 Registry, you can store and manage multiple versions of your models, track their lineage, and ensure that your models are deployed consistently across different environments.
And in ML Flow, a model is a versioned artifact that includes a trained machine learning model and any additional artifacts needed for deployment, such as configuration files or other dependencies. ML flow models can be saved in different formats such as Python functions, Docker containers, or other formats like ONNX, Tensorflow, and pie torch that can be used for inference in a variety of environments including cloud services or edge devices.
So training a machine learning model from scratch can be a time-consuming process and very resource intensive as well. So that’s why prebuilt AI services from both Microsoft and Amazon allow you to skip the training process and start using the AI right away. These services are built on top of cutting-edge machine learning technologies and are designed to provide high performance and accurate results.
And they’re designed to scale seamlessly to handle large amounts of data and high volumes of requests. Building your own models from scratch may require significant engineering effort to achieve similar levels of scalability. Now, given that there are people who believe that building your own ML model saves you cost in the long run as there are you know infrastructures like Google Collab, Azure Notebook, and Amazon Sage Maker Notebook who charge you for the resources used for training.
So you can use those resources for training on GPUs and such. You could also build your own local Kubernetes cluster where you can set up a training pipeline using open-source technologies like ML Flow, Apache Spark, and Pie Torch, provided you have the hardware to go with it.
So this shows the ML OPS pipeline on Azure. In step one you bring together all your structured, unstructured, and semi-structured data which are logs, files, and media into Azure data like storage Gen. 2. In Step 2, you use Apache Spark in Azure Synapse Analytics to clean, transform and analyze datasets.
In Step 3, build and train machine learning models in Azure ML.
Of course, you will need to create a training script for your machine learning model using the language and framework of your choice, such as Python And Tensorflow, Create a compute resource like an easy to instance or a far gate to execute the script and then link the AzureML workspace into the Synapse workspace before submitting the job to AzureML. And once the training job is complete and you’re happy with the results, you can register the model.
In Step 4, you control access and authentication for data and the ML workspace with Azure Active Directory and Azure Key Vault. In step five, you deploy the machine learning model to a container using Azure Kubernet services. Of course, you will need to build a Docker image first using Azure ML SDK with what is called an inference configuration that includes the model file, the dependencies, and the Azure Container Registry or ACR to store the Docker image and in step 6, using log metrics and monitoring from Azure Monitor, you can evaluate model performance and in step seven you can retrain models as necessary in Azure Machine Learning.
In step eight you visualize the data outputs with Power BI. Just like the prebuilt AI services from Microsoft, native AI services from AWS also allow you to skip the training process in machine learning and start using AI right, right away.
Of course, Amazon provides the Sage Maker as the AzureML equivalent to train and build your own AI/ML model, which we’ll see in the next slide. It supports a bunch of machine learning frameworks like Pie, Torch, Tensorflow, Keras, and infrastructure like deep learning Amis, GPUs, CPUs, and FPGAs.
So in this particular MLOPS pipeline on AWS, in step one, you create the new Amazon Sage Maker Studio project based on the custom template. In Step 2, you create a Sage Maker pipeline to perform data preprocessing, generate baseline statistics, and train a model. In step three, you register the model in the Sage Maker Model registry. In Step 4, the data scientist verifies the model metrics and performance and approves the model.
In Step 5, the model is deployed to a real-time endpoint in staging and after approval to a production endpoint. In step 6, the Amazon Stage Maker model monitor is configured on the production endpoint to detect any concept drift of the data with respect to the training baseline.
In step 7 model monitor is scheduled to run every hour and publishes metrics to cloud watch.
And in step 8A, the Cloudwatch alarm is raised when the metrics exceed a model-specific threshold, like let’s say accuracy, you know less than 75%.
This results in an event bridge rule starting the model build pipeline. And in step 9, the model build pipeline can also be retrained with an event bridge rule that runs on a schedule.
So now we come to the last section of this webinar which is the use cases.
Let’s look at some use cases of this architecture in different industries. In the interest of time, I might just focus on a subset of those listed in the slides. So if you consider a medallion architecture inside your data lake, you know we talked about bronze, silver, and gold. The bronze data might include the point of sale data, the transaction data from suppliers, and social media mentions of your food products.
And the silver data might include facts, measures, and dimensions like sales data by product category or region, supplier performance metrics, and maybe customer demographic data and so on.
And gold data is typically derived from silver data through advanced analytics and machine learning algorithms and might include things like customer segmentation based on purchasing behavior, demand forecasting, or recommendations for new products based on customer preferences.
Like, if the customer you know like this particular product, we can recommend new products based on an AI/ML model. Such a data architecture could expose API endpoints such as the one which could be used to predict these suggestions to a new customer once they select a particular dish. We are currently executing a project that is capable of extracting hundreds of custom attributes from a patient EMR document through an AI/ML model and storing them as columns in a table in Delta Lake for subsequent querying.
While we could have opted for AWS or Azure as our managed services provider, due to the customer’s specific requirement, we decided to build an on-premise Delta Lake, Delta Lake Apache Spark, and ML flow-based solution. Actually, all three are open-source products, so we built a Kubernetes disk cluster that runs all these as opposed to something on the cloud.
The advantage was obviously the savings in cost since the company had already invested in a data center. But the disadvantage was the long training times due to the kind of hardware that was made available to us which is not GPU based. Something you want to keep in mind. We have plans to collaborate with the customer further in the areas of predictive analytics and fraud detection as well.
Education, a few years ago we implemented an AI/ML model that would predict student outcomes based on the major and minor selection during counseling. So of course the scalability and accuracy of models were a major concern back then. And with the tools, kind of tools that are available now and the availability of cloud-based infrastructures, that’s no longer a concern these days. And I believe that predictive analytics would help with the early identification of students at risk, and proper resource allocation to help such students, which would be a great boon to the education sector.
Applications in manufacturing – Predictive maintenance is probably the number one use case in this sector. Companies can save tons of money if they’re able to predict in advance the probability of failure a few days or weeks or months from now. Sensor data from machines could be ingested and processed to optimize production processes.
You could have a dashboard that shows the defaults per minute or per hour in real time. By analyzing data on inventory demand and shipping times, companies can optimize the supply chain, reduce costs and improve efficiency. By tracking materials and products through the supply chain using RFID and other technologies, companies can gain greater visibility into their operations, improving efficiency and reducing waste.
Transportation – Google provides real-time traffic data for many cities around the world. This data is publicly available and can be used to analyze traffic patterns in specific areas. You could bring that into your data lake. For example, by analyzing data on such traffic patterns, weather patterns, and road conditions, companies can optimize routes, reducing travel time and fuel costs, and do route optimization processing.
Vehicles fitted with devices can send time series data that can be ingested through a message broker and processed in real time to predict when maintenance is necessary, reducing overall downtime.
In equipment maintenance, these are pretty standard use cases, especially predictive maintenance which is currently being deployed by many companies in the maintenance industry. By analyzing data on parts usage and equipment failures, companies can optimize their parts inventory, reducing inventory costs and improving maintenance efficiency.
By analyzing data on equipment failures, the root causes of failures can be determined, enabling companies to address underlying issues and prevent future failures. Opportunities to improve equipment efficiency and reduce energy consumption can be identified by analyzing data on equipment performance.
So in conclusion, modern data architectures for the cloud have brought significant advancements to data management and analysis. The use of data lakes provides a scalable and cost-effective approach to storing and processing large amounts of data. Extract, Load, and Transform techniques have replaced the traditional Extract, Transform, Load, or ETL approach enabling faster data injection and integration with cloud-based data warehouses.
AI/ML technologies have made it possible to extract meaningful insights from complex data sets and visualization tools make it easier for decision-makers to understand and act on the results. However, with the increased use of data comes the need for strong data governance practices. Organizations must prioritize data privacy, security, and ethical use, especially when handling sensitive information.
It is essential to implement strict policies and processes to ensure compliance with regulatory requirements and ethical standards.
In summary, modern data architectures offer significant advantages for businesses looking to leverage data to gain insights and make informed decisions. By incorporating data governance principles into their data management practices, organizations can fully realize the benefits of modern data architectures while protecting their data and preserving the trust of their customers and stakeholders.
With that, we come to the end of this webinar. There’s a lot of opportunity in the area of building cost-effective data-driven solutions that involve ingesting, transforming, and visualizing data at scale and generating data products that can be consumed by employees and partners with proper governance.
I hope you gained enough insight into the topic and in case let me see if I have any questions out there in the chat. Yeah, how are exceptions handled in modern data architectures?
OK, I’ll answer a couple of questions. How are exceptions handled in modern data architectures? In modern data architectures, exceptions are typically handled through a combination of monitoring, alerting, and automated workflows. So exception handling often starts with monitoring the data pipeline for errors or unexpected events. So this can be done using monitoring tools such as if you’re an AWS you know AWS Cloudwatch Is the perfect place where you know you could trigger a workflow if a certain metric crosses A threshold or something, or if there’s an exception. So that is the nature of exception handling for such pipelines.
I’ll probably answer one more question and then OK, so this is an interesting question. So how do I Visualize the throughput per hour per day for each of my machines, cutting machines on my shop floor?
OK, so this is, I mean I think I mentioned structured streaming. I presume there would be some sensor per machine that would send data to the data lake via Amazon Kinesis or something similar. We could run a pi-spark-based process that listens to the Kinesis stream and do aggregations using the window function probably where you specify the duration like one hour if you want to do the aggregation per hour so it’ll you know to package up all the messages you know in that hour and this can be stored into a table in delta lake and then you know do some aggregations either post process and then visualize it in a dashboard. So I would use that as the mechanism.
And there’s one last question on Azure. I’ll take that also, which is how you decide whether to choose a dedicated SQL pool versus a serverless SQL pool.
Yeah, if your workload is, let’s say, unpredictable and it fluctuates frequently, I would say a serverless SQL pool may be a better choice because you only pay for the resources you use.
And the serverless pool can automatically scale up or down as needed. Whereas with dedicated pools you know although it provides better performance and faster query processing times, you need to dedicate resources. So if you know your workload clearly you can specify the configuration of the resources that are required then. That could be a better choice.
However serverless SQL pools can also provide good performance if you use them appropriately and optimize your query. So you may want to just check it out and see which is which suits you better. And dedicated SQL pools, like I said you know, require you to provision and pay for the resources in advance, whereas serverless SQL pools only charge you for the resources you actually use.
So I think that sums up our webinar. Please send any further questions to my e-mail. It’s email@example.com and thanks everyone for joining and thanks for your time.