Bulk Loading Parquet Files into SQL Server with PolyBase

By Tom Nonmacher

Welcome to SQLSupport.org! Today we'll be exploring how to use PolyBase to bulk load Parquet files into SQL Server. Parquet, a compressed and efficient columnar storage format, has been rapidly gaining in popularity due to its impressive speed and flexibility. PolyBase, on the other hand, is a technology that can be used to access and combine data from different sources in SQL Server. It's a significant tool in SQL Server 2016 and later versions for working with Big Data.

Before we can begin loading Parquet files, we need to establish a connection with SQL Server. Let's create a linked server using T-SQL in SQL Server 2014 or 2012. This server will communicate with the PolyBase engine and facilitate the loading of data.

EXEC master.dbo.sp_addlinkedserver
@server = N'POLYBASE_LINKED_SERVER',
@srvproduct=N'PolyBase',
@provider=N'MSDASQL',
@datasrc=N'PolyBase',
@provstr=N'Provider=PolyBase;Data Source=PolyBase;';

Next, we'll need to create an external data source that points to our Parquet file. In this case, we're assuming that the file is stored in an Azure Blob Storage.

CREATE EXTERNAL DATA SOURCE AzureDataLakeStore
WITH
( TYPE = HADOOP, LOCATION = 'wasbs://datalake@myaccount.blob.core.windows.net',
CREDENTIAL = MyAzureAccount);

Now, we can proceed to create an external file format for Parquet files. Note that the format_type is set as PARQUET, signifying the type of file we're dealing with.

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH
( FORMAT_TYPE = PARQUET);

Once the external file format is set, we can create an external table that can read data from our Parquet file. In this step, we must ensure that the schema of our external table matches the schema of the Parquet file.

CREATE EXTERNAL TABLE ParquetTable
( id INT, name VARCHAR(50), age INT )
WITH
( LOCATION = '/data.parquet',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = ParquetFormat,
REJECT_TYPE = VALUE, REJECT_VALUE = 0 );

Finally, we're ready to import the data from the Parquet file into SQL Server. This can be done by using a simple INSERT INTO SELECT statement.

INSERT INTO dbo.MyTable
SELECT * FROM ParquetTable;

In conclusion, PolyBase is a powerful tool that allows SQL Server to process and integrate data stored in diverse formats, such as Parquet, across multiple platforms. It's crucial to note that while we used SQL Server 2012 and 2014 in our examples, PolyBase is only available in SQL Server 2016 and later versions, so some modifications may be required for other versions or other databases, such as MySQL 5.6 or DB2 10.5. Stay tuned to SQLSupport.org for more insights and practical tips on SQL Server!




A2597E
Please enter the code from the image above in the box below.