CLUSTERED COLUMNSTORE INDEXES (CCI) provide extreme data compression. In Azure SQL Database and SQL Server vNext you can create CCI on tables with NVARCHAR(MAX) columns. Since JSON is stored as NVARCHAR type, now you can store huge volumes of JSON data in tables with CCI. In this post, I will show you how you can get 25x compression on a table that contains JSON/NVARCHAR(MAX) column using CCI.
In this experiment, I will use publicly available ContosoDW database where we have FactOnlineSales table with 12 million records. This fact table is related to a number of dimension tables such as DimStore, DimCustomer, DimProduct, etc., as it is shown on the following figure:
Imagine that all these related tables are stored in NoSQL-style as a single JSON document. In that case we would have a single table that will have some Data NVARCHAR(MAX) column where we would store data from all related tables as JSON text – something like:
DROP TABLE IF EXISTS SalesRecords
GO
CREATE TABLE SalesRecords(
OrderNumber nvarchar(20) not null,
ProductKey int not null,
StoreKey int not null,
Date datetime not null,
CustomerKey int not null,
Data nvarchar(max) not null,
INDEX cci CLLUSTERED COLUMNSTORE
)
In this example, I will keep int key columns as separate columns and store all other columns from FactOnlineSales table and all columns from the related tables in a single Data column. The query that will de-normalize all related dimensions into a single column is shown below:
INSERT INTO SalesRecords(OrderNumber, StoreKey, ProductKey, Date, CustomerKey, Data)
SELECT FactOnlineSales.SalesOrderNumber, FactOnlineSales.StoreKey, FactOnlineSales.ProductKey, FactOnlineSales.DateKey, FactOnlineSales.CustomerKey,
FactOnlineSales.PromotionKey,
FactOnlineSales.SalesOrderLineNumber, FactOnlineSales.SalesQuantity, FactOnlineSales.SalesAmount, FactOnlineSales.CurrencyKey,
FactOnlineSales.ReturnQuantity, FactOnlineSales.ReturnAmount, FactOnlineSales.DiscountQuantity, FactOnlineSales.DiscountAmount, FactOnlineSales.TotalCost,
FactOnlineSales.UnitCost, FactOnlineSales.UnitPrice,
DimProduct.ProductName AS [Product.Name], DimProduct.ProductDescription AS [Product.Description], DimProduct.Manufacturer AS [Product.Manufacturer],
DimProduct.BrandName AS [Product.Brand], DimProduct.ClassName AS [Product.Class], DimProduct.StyleName AS [Product.Style],
DimProduct.ColorName AS [Product.Color], DimProduct.Size AS [Product.Size], DimProduct.SizeRange AS [Product.SizeRange],
DimProduct.Weight AS [Product.Weight], DimProduct.UnitCost AS [Product.UnitCost], DimProduct.UnitPrice AS [Product.UnitPrice],
DimProduct.ImageURL AS [Product.ImageURL], DimProduct.ProductURL AS [Product.URL],
DimProductSubcategory.ProductSubcategoryLabel AS [Product.SubcategoryLabel], DimProductSubcategory.ProductSubcategoryName AS [Product.SubcategoryName],
DimProductSubcategory.ProductSubcategoryDescription AS [Product.SubcategoryDescription], DimProductCategory.ProductCategoryLabel AS [Product.CategoryLabel],
DimProductCategory.ProductCategoryName AS [Product.CategoryName], DimProductCategory.ProductCategoryDescription AS [Product.CategoryDescription],
DimCustomer.CustomerLabel AS [Customer.Label], DimCustomer.Title AS [Customer.Title], DimCustomer.FirstName AS [Customer.FirstName],
DimCustomer.MiddleName AS [Customer.MiddleName], DimCustomer.LastName AS [Customer.LastName], DimCustomer.NameStyle AS [Customer.NameStyle],
DimCustomer.BirthDate AS [Customer.BirthDate], DimCustomer.MaritalStatus AS [Customer.MaritalStatus], DimCustomer.Suffix AS [Customer.Suffix],
DimCustomer.Gender AS [Customer.Gender], DimCustomer.EmailAddress AS [Customer.EmailAddress], DimCustomer.YearlyIncome AS [Customer.YearlyIncome],
DimCustomer.TotalChildren AS [Customer.TotalChildren], DimCustomer.NumberChildrenAtHome AS [Customer.NumberChildrenAtHome],
DimCustomer.Education AS [Customer.Education], DimCustomer.Occupation AS [Customer.Occupation],
DimCustomer.HouseOwnerFlag AS [Customer.HouseOwnerFlag], DimCustomer.AddressLine1 AS [Customer.AddressLine1],
DimCustomer.NumberCarsOwned AS [Customer.NumberCarsOwned], DimCustomer.AddressLine2 AS [Customer.AddressLine2],
DimCustomer.Phone AS [Customer.Phone], DimCustomer.CompanyName AS [Customer.CompanyName], DimGeography_1.CityName AS [Customer.CityName],
DimGeography_1.StateProvinceName AS [Customer.StateProvinceName], DimGeography_1.RegionCountryName AS [Customer.RegionCountryName],
DimGeography_1.ContinentName AS [Customer.ContinentName], DimGeography_1.GeographyType AS [Customer.GeographyType],
JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimGeography_1.Geometry.STX,',', DimGeography_1.Geometry.STY,']}}')) AS [Customer.Geometry],
DimCurrency.CurrencyName AS [Currency.Name], DimCurrency.CurrencyDescription AS [Currency.Description],
DimCurrency.CurrencyLabel AS [Currency.Label], DimPromotion.PromotionLabel AS [Promotion.Label], DimPromotion.PromotionName AS [Promotion.Name],
DimPromotion.PromotionDescription AS [Promotion.Description], DimPromotion.DiscountPercent AS [Promotion.DiscountPercent],
DimPromotion.PromotionType AS [Promotion.Type], DimPromotion.PromotionCategory AS [Promotion.Category], DimPromotion.StartDate AS [Promotion.StartDate],
DimPromotion.EndDate AS [Promotion.EndDate], DimPromotion.MinQuantity AS [Promotion.MinQuantity], DimPromotion.MaxQuantity AS [Promotion.MaxQuantity],
DimStore.StoreName AS [Store.Name], DimStore.StoreDescription AS [Store.Description], DimStore.StoreManager AS [Store.Manager],
DimStore.StoreType AS [Store.Type], DimStore.Status AS [Store.Status], DimStore.OpenDate AS [Store.OpenDate], DimStore.CloseDate AS [Store.CloseDate],
DimStore.ZipCode AS [Store.ZipCode], DimStore.ZipCodeExtension AS [Store.ZipCodeExtension], DimStore.StorePhone AS [Store.Phone],
DimStore.StoreFax AS [Store.Fax], DimStore.AddressLine1 AS [Store.AddressLine1], DimStore.AddressLine2 AS [Store.AddressLine2],
JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimStore.Geometry.STX,',', DimStore.Geometry.STY,']}}')) AS [Store.Geometry],
JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimStore.GeoLocation.STX,',', DimStore.GeoLocation.STY,']}}')) AS [Store.GeoLocation],
JSON_QUERY(dbo.geometry2json(DimStore.Geometry)) AS [Store.Geometry], DimGeography.CityName AS [Store.CityName],
DimGeography.StateProvinceName AS [Store.StateProvinceName], DimGeography.RegionCountryName AS [Store.RegionCountryName],
DimGeography.ContinentName AS [Store.ContinentName], DimGeography.GeographyType AS [Store.GeographyType],
JSON_QUERY(CONCAT('{"type": "Feature","geometry": {"type": "Point","coordinates": [',DimGeography.Geometry.STX,',', DimGeography.Geometry.STY,']}}')) AS [Store.Geo.Location],
DimGeography.GeographyKey AS [Store.GeographyKey], DimEntity.EntityLabel AS [Store.Entity.Label],
DimEntity.EntityName AS [Store.Entity.Name], DimEntity.EntityDescription AS [Store.Entity.Description], DimEntity.EntityType AS [Store.Entity.Type],
DimEntity.Status AS [Store.Entity.Status], DimDate.FullDateLabel AS [Date.FullDateLabel], DimDate.DateDescription AS [Date.DateDescription],
DimDate.CalendarYear AS [Date.CalendarYear], DimDate.CalendarMonthLabel AS [Date.CalendarMonthLabel], DimDate.FiscalYear AS [Date.FiscalYear],
DimDate.FiscalMonth AS [Date.FiscalMonth], DimDate.FiscalYearLabel AS [Date.FiscalYearLabel], DimDate.CalendarYearLabel AS [Date.CalendarYearLabel],
DimDate.CalendarHalfYear AS [Date.CalendarHalfYear], DimDate.CalendarHalfYearLabel AS [Date.CalendarHalfYearLabel], DimDate.Datekey AS [Date.Datekey],
DimDate.CalendarQuarter AS [Date.CalendarQuarter], DimDate.CalendarQuarterLabel AS [Date.CalendarQuarterLabel],
DimDate.CalendarMonth AS [Date.CalendarMonth], DimDate.CalendarWeek AS [Date.CalendarWeek], DimDate.CalendarWeekLabel AS [Date.CalendarWeekLabel],
DimDate.CalendarDayOfWeekLabel AS [Date.CalendarDayOfWeekLabel], DimDate.CalendarDayOfWeek AS [Date.CalendarDayOfWeek],
DimDate.FiscalHalfYear AS [Date.FiscalHalfYear], DimDate.FiscalHalfYearLabel AS [Date.FiscalHalfYearLabel], DimDate.FiscalQuarter AS [Date.FiscalQuarter],
DimDate.FiscalQuarterLabel AS [Date.FiscalQuarterLabel], DimDate.FiscalMonthLabel AS [Date.FiscalMonthLabel]
FROM FactOnlineSales INNER JOIN
DimDate ON FactOnlineSales.DateKey = DimDate.Datekey INNER JOIN
DimStore ON FactOnlineSales.StoreKey = DimStore.StoreKey INNER JOIN
DimProduct ON FactOnlineSales.ProductKey = DimProduct.ProductKey INNER JOIN
DimPromotion ON FactOnlineSales.PromotionKey = DimPromotion.PromotionKey INNER JOIN
DimCurrency ON FactOnlineSales.CurrencyKey = DimCurrency.CurrencyKey INNER JOIN
DimCustomer ON FactOnlineSales.CustomerKey = DimCustomer.CustomerKey INNER JOIN
DimGeography ON DimStore.GeographyKey = DimGeography.GeographyKey INNER JOIN
DimProductSubcategory ON DimProduct.ProductSubcategoryKey = DimProductSubcategory.ProductSubcategoryKey INNER JOIN
DimProductCategory ON DimProductSubcategory.ProductCategoryKey = DimProductCategory.ProductCategoryKey INNER JOIN
DimGeography DimGeography_1 ON DimCustomer.GeographyKey = DimGeography_1.GeographyKey INNER JOIN
DimEntity ON DimStore.EntityKey = DimEntity.EntityKey
I’m joining all related dimension tables, format them as JSON text using FOR JSON clause and loading everything into SalesRecords table. This query will populate a table with CCI index.
I will also create a copy of this table but without CCI (plain heap table) using the following query:
select *
into SalesRecordsRS
from SalesRecords
Now, I will compare sizes of the table with CCI and the table without CCI using the following query:
exec sp_spaceused 'SalesRecordsRS'
exec sp_spaceused 'SalesRecords'
The results of these queries are shown below:
A table without CCI has 101.020.896 KB, while the table with CCI has only 4.047.128 KB in data column. With CCI we can compress 100GB table to 4GB with 24.96 compression ratio!
Compression is not important only for storage savings. The following query on a table with CCI this query is executed in 46 seconds, while on a heap table execution takes 13 min 45 seconds.
select min(datalength(data)), avg(datalength(data)), max(datalength(data))
from SalesRecords
Smaller disk io and batch execution provided by CCI enables you to run 18x faster queries.
Conclusion
CLUSTERED COLUMNSTORE INDEXES provide extreme data compression in SQL Server and Azure SQL Database. With NVARCHAR(MAX) support in CCI indexes you can use them on your JSON data stored is database and get high 25x compression. Therefore, CCI is a perfect solution if you need to store a large volume of JSON data in your SQL Database.
ContosoDW database is publicly available for download, so you can use this database and the scripts from this post to try this in your environment.