SQL Server 2022: Data Virtualization with External Sources Note from the Data Whisperer
By Tom Nonmacher
The advent of SQL Server 2022 has ushered in a new era of data management capabilities, providing businesses with the tools needed to integrate and analyze data from various sources. One such feature that has particularly piqued the interest of data professionals is data virtualization with external sources. Data virtualization is a technique that allows for real-time access and manipulation of data from multiple disparate sources as if it were a single unified data source. This capability is a game-changer, especially in scenarios where data is scattered across various databases, both on-premise and in the cloud.
SQL Server 2022's implementation of data virtualization is built on the PolyBase technology, which was first introduced in SQL Server 2016. PolyBase allows for T-SQL queries to access data stored in various external sources such as Azure SQL Database, Azure Blob Storage, and even NoSQL databases. What makes SQL Server 2022 unique is its enhanced support for data virtualization with additional data sources like Delta Lake and Databricks, which are notably used in big data environments.
Let's take an example where we need to query data from a Delta Lake storage. First, we need to create an external data source that points to our Delta Lake storage. Here's how it could be done:
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage WITH
(TYPE = DELTALAKE,
LOCATION = 'dlfs://datalakestorageaccount.dfs.core.windows.net/',
CREDENTIAL = SqlStorageCredential);
After creating the data source, we can then create an external table that maps to the data in our Delta Lake storage. This allows us to query this data as if it were a local table in our SQL Server database.
CREATE EXTERNAL TABLE DeltaLakeTable
(CustomerID int,
Name varchar(50),
PurchaseAmount float)
WITH
(LOCATION = '/deltalakefolder/deltalakefile',
DATA_SOURCE = DeltaLakeStorage);
The beauty of data virtualization in SQL Server 2022 is that it doesn’t stop at querying data; you can also perform write operations. This means you can insert, update, and delete records in your external data sources using T-SQL commands, just like you would on a local table.
Data virtualization in SQL Server 2022 also features integration with Microsoft Fabric, a distributed systems platform that provides large-scale, high-availability services. This integration allows for improved performance and reliability when executing queries against large data sets or in high-traffic environments.
Another intriguing development is the integration of OpenAI with SQL Server 2022. This potent combination presents an exciting prospect for data analysts and scientists, enabling them to leverage artificial intelligence and machine learning capabilities directly within their SQL Server environment. It offers an efficient and seamless way to build, train, and deploy machine learning models using the familiar T-SQL language.
In conclusion, data virtualization in SQL Server 2022 is a significant leap forward in big data management. It provides a unified, flexible, and efficient framework for accessing and manipulating data from various sources. With the integration of technologies such as Azure SQL, Microsoft Fabric, Delta Lake, Databricks, and OpenAI, SQL Server 2022 is undoubtedly a powerful platform for data professionals.