Market Basket Analysis with SQL

I recently read about ‘affinity grouping’ (populairly known as market basket analysis) in The Data Warehouse Toolkit by Ralph Kimball. This interested me into finding an algorithm in SQL to get your usual fact table data into a affinity grouping table.

Affinity grouping gives you insight into what events correlate together. In my example I will try to get insight into what products are often sold together in the AdventureWorks example database.

As a starting point I have a transaction fact table:











This fact table measures sales quantity and gross profit per sold item. The dimensions are pretty straight forward, but it’s important to emphasise that the Sales Order Info dimension containts the sales order numbers and sales order lines.

Our goal is to get the data inside the following affinity grouping model:











This model will tell us how many a product(A) is sold in combination with product(B). The model is a daily snapshot and can also be sliced  and rolled up by it’s dimension attributes.

To get my data from the transaction fact table into the affinity grouping table I thought out the following algorithm:

1. Select the order lines of a single order. Which should generate a set like this:

FK_OrderDate FK_SalesChannel FK_SpecialOffer FK_Product FK_SalesOrderInfo OrderQty GrossProfit
20050701 1 -1 270 8 3 1258.376700
20050701 1 -1 268 8 1 419.458900
20050701 1 -1 259 8 1 874.794000
20050701 1 -1 230 8 1 183.938200

2.Select a single product from the set.  In this example we take ProductID 270.

3. Correlate the product with all the other products from the order set. This should result into the following set:

FK_OrderDate FK_SalesChannel FK_SalesOrderInfo FK_SpecialOffer FK_ProductA FK_ProductB Order Qty A Order Qty B Gross   Profit A Gross   Profit B
20050701 1 8 -1 270 268 3 1 1258.376700 419.458900
20050701 1 8 -1 270 259 3 1 1258.376700 874.794000
20050701 1 8 -1 270 230 3 1 1258.376700 183.938200

Now we have a set of products(B) that are sold in combination with product A. We have registered the sales quantity and gross profit. This gives us the ability to find the combinations that are the most profitable.

4. Save this result set.

5. Repeat step 2-4 for all products in the order.

6.Repeat steps 1-5 for all other orders.

7. Sum the Order Qty A, Order Qty B, Gross Profit A, Gross Profit B of the entire set grouped by order date, sales channel, special offer, product A, product B and save it as a final result into your affinity grouping model.

I wrote step 1-6 in T-SQL like this:

--Declare variables
DECLARE @OrientationProductID AS INT

--Build a list of baskets that we will loop
INTO #SalesOrderOrientation
FROM FactSales

--Every time the first loop iterates it will take one salesorder out of the collection
SET @FK_SalesOrderInfo = (SELECT TOP(1) FK_SalesOrderInfo FROM #SalesOrderOrientation)
DELETE TOP(1) FROM #SalesOrderOrientation

While (@FK_SalesOrderInfo IS NOT NULL)

--Get the data of this basket
SELECT * into #tempBasket FROM
SELECT FK_OrderDate, FK_SalesChannel, FK_SpecialOffer, FK_Product, FK_SalesOrderInfo, OrderQty, GrossProfit FROM FactSales
WHERE FK_SalesOrderInfo = @FK_SalesOrderInfo
) AS Basket

--Get a list of products that are inside this basket.
--We use this list of products to determine what products from the basket we still have to process.
SELECT FK_Product INTO #ProductOrientation
FROM #tempBasket

--Every time the second loops interates it will take one product out of the collection. The delete happens at the end of the loop.
SET @OrientationProductID = (SELECT TOP(1) FK_Product FROM #ProductOrientation)

While @OrientationProductID IS NOT NULL

--This query joins the product that is in the @OrientationProductID variable with all other products inside the basket.
--On this way it creates a set of correlated products. This set is inserted into a table
INSERT INTO ProductCorrelations (FK_OrderDate, FK_SalesChannel, FK_SpecialOffer, PO.FK_SalesOrderInfo, FK_ProductA, OrderQtyA, GrossProfitA, CP.FK_ProductB, CP.OrderQtyB, CP.GrossProfitB)
SELECT LeadingProduct.FK_OrderDate, LeadingProduct.FK_SalesChannel, LeadingProduct.FK_SpecialOffer, CorrelatedProducts.FK_SalesOrderInfo,
LeadingProduct.FK_Product AS FK_ProductA, LeadingProduct.OrderQty AS OrderQtyA, LeadingProduct.GrossProfit AS GrossProfitA,
CorrelatedProducts.FK_Product AS FK_Product, CorrelatedProducts.OrderQty AS OrderQtyB, CorrelatedProducts.GrossProfit AS GrossProfitB
FROM #tempBasket AS LeadingProduct INNER JOIN #tempBasket AS CorrelatedProducts
ON LeadingProduct.FK_SalesOrderInfo = CorrelatedProducts.FK_SalesOrderInfo
WHERE LeadingProduct.FK_Product = @OrientationProductID AND CorrelatedProducts.FK_Product != @OrientationProductID

DELETE FROM #ProductOrientation WHERE FK_Product = @OrientationProductID
SET @OrientationProductID = (SELECT TOP(1) FK_Product FROM #ProductOrientation)

SET @FK_SalesOrderInfo = (SELECT TOP(1) FK_SalesOrderInfo FROM #SalesOrderOrientation)
DELETE TOP(1) FROM #SalesOrderOrientation
DROP TABLE #tempBasket
DROP TABLE #ProductOrientation

DROP TABLE #SalesOrderOrientation

You now have have a set of all products that are sold together. You now just have to aggregate the results before inserting them into your final MarketBasketAnalysis table.

We do this with the following T-SQL statement:

TRUNCATE TABLE MarketBasketAnalysis
INSERT INTO MarketBasketAnalysis ([FK_OrderDate], [FK_SalesChannel], [FK_SpecialOffer], [FK_ProductA], [FK_ProductB], [OrderQtyA], [OrderQtyB], [GrossProfitA], [GrossProfitB])
SELECT [FK_OrderDate]
,SUM([OrderQtyA]) AS OrderQtyA
,SUM([OrderQtyB]) AS OrderQtyB
,SUM([GrossProfitA]) AS GrossProfitA
,SUM([GrossProfitB]) AS GrossProfitB
FROM [dbo].[ProductCorrelations]

And we are done! We can now see what products combinations are selling most often together. This information can give you better judgment on what the effect of special offers will be, what products go well into packaged deals, how to arange your store layout, or how to advertise products on your webshop.

Concerns about the algorithm

The algorithm took about 12 minutes to complete on my laptop, which is acceptable. But the product catalogue and sales data set of AdventureWorks is not that big, I can imagine that this algorithm might take to long for the data of example a supermarket.  The AdventureWorks database has 500+ products and 120.000+ order lines. This resultat into 1.984.276 sales combinations. As you either increase the number of products or the number of order lines the number of sales combinations will increase significantly.

Performance gains can be reached by reducing disk I/O by keeping more data in memory, do incremental loads,  reduce the amount of products you will evaluate (take the most profitable products) or make a smarter algorithm :-).

If you have any suggestions about this feel free to post a comment!


, , , ,

  1. Link Resource # 54 : May 11 – May 17 « Dactylonomy of Web Resource

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: