I recently read about ‘affinity grouping’ (populairly known as market basket analysis) in The Data Warehouse Toolkit by Ralph Kimball. This interested me into finding an algorithm in SQL to get your usual fact table data into a affinity grouping table.

Affinity grouping gives you insight into what events correlate together. In my example I will try to get insight into what products are often sold together in the AdventureWorks example database.

As a starting point I have a transaction fact table:

This fact table measures sales quantity and gross profit per sold item. The dimensions are pretty straight forward, but it’s important to emphasise that the Sales Order Info dimension containts the sales order numbers and sales order lines.

Our goal is to get the data inside the following affinity grouping model:

This model will tell us how many a product(A) is sold in combination with product(B). The model is a daily snapshot and can also be sliced and rolled up by it’s dimension attributes.

To get my data from the transaction fact table into the affinity grouping table I thought out the following algorithm:

1. Select the order lines of a single order. Which should generate a set like this:

FK_OrderDate |
FK_SalesChannel |
FK_SpecialOffer |
FK_Product |
FK_SalesOrderInfo |
OrderQty |
GrossProfit |

20050701 | 1 | -1 | 270 | 8 | 3 | 1258.376700 |

20050701 | 1 | -1 | 268 | 8 | 1 | 419.458900 |

20050701 | 1 | -1 | 259 | 8 | 1 | 874.794000 |

20050701 | 1 | -1 | 230 | 8 | 1 | 183.938200 |

2.Select a single product from the set. In this example we take ProductID 270.

3. Correlate the product with all the other products from the order set. This should result into the following set:

FK_OrderDate |
FK_SalesChannel |
FK_SalesOrderInfo |
FK_SpecialOffer |
FK_ProductA |
FK_ProductB |
Order Qty A |
Order Qty B |
Gross Profit A |
Gross Profit B |

20050701 | 1 | 8 | -1 | 270 | 268 | 3 | 1 | 1258.376700 | 419.458900 |

20050701 | 1 | 8 | -1 | 270 | 259 | 3 | 1 | 1258.376700 | 874.794000 |

20050701 | 1 | 8 | -1 | 270 | 230 | 3 | 1 | 1258.376700 | 183.938200 |

Now we have a set of products(B) that are sold in combination with product A. We have registered the sales quantity and gross profit. This gives us the ability to find the combinations that are the most profitable.

4. Save this result set.

5. Repeat step 2-4 for all products in the order.

6.Repeat steps 1-5 for all other orders.

7. Sum the Order Qty A, Order Qty B, Gross Profit A, Gross Profit B of the entire set grouped by order date, sales channel, special offer, product A, product B and save it as a final result into your affinity grouping model.

I wrote step 1-6 in T-SQL like this:

--Declare variables

DECLARE @FK_SalesOrderInfo AS INT

DECLARE @OrientationProductID AS INT

--Build a list of baskets that we will loop

SELECT DISTINCT FK_SalesOrderInfo

INTO #SalesOrderOrientation

FROM FactSales

--Every time the first loop iterates it will take one salesorder out of the collection

SET @FK_SalesOrderInfo = (SELECT TOP(1) FK_SalesOrderInfo FROM #SalesOrderOrientation)

DELETE TOP(1) FROM #SalesOrderOrientation

While (@FK_SalesOrderInfo IS NOT NULL)

Begin

--Get the data of this basket

SELECT * into #tempBasket FROM

(

SELECT FK_OrderDate, FK_SalesChannel, FK_SpecialOffer, FK_Product, FK_SalesOrderInfo, OrderQty, GrossProfit FROM FactSales

WHERE FK_SalesOrderInfo = @FK_SalesOrderInfo

) AS Basket

--Get a list of products that are inside this basket.

--We use this list of products to determine what products from the basket we still have to process.

SELECT FK_Product INTO #ProductOrientation

FROM #tempBasket

--Every time the second loops interates it will take one product out of the collection. The delete happens at the end of the loop.

SET @OrientationProductID = (SELECT TOP(1) FK_Product FROM #ProductOrientation)

While @OrientationProductID IS NOT NULL

Begin

--This query joins the product that is in the @OrientationProductID variable with all other products inside the basket.

--On this way it creates a set of correlated products. This set is inserted into a table

INSERT INTO ProductCorrelations (FK_OrderDate, FK_SalesChannel, FK_SpecialOffer, PO.FK_SalesOrderInfo, FK_ProductA, OrderQtyA, GrossProfitA, CP.FK_ProductB, CP.OrderQtyB, CP.GrossProfitB)

SELECT LeadingProduct.FK_OrderDate, LeadingProduct.FK_SalesChannel, LeadingProduct.FK_SpecialOffer, CorrelatedProducts.FK_SalesOrderInfo,

LeadingProduct.FK_Product AS FK_ProductA, LeadingProduct.OrderQty AS OrderQtyA, LeadingProduct.GrossProfit AS GrossProfitA,

CorrelatedProducts.FK_Product AS FK_Product, CorrelatedProducts.OrderQty AS OrderQtyB, CorrelatedProducts.GrossProfit AS GrossProfitB

FROM #tempBasket AS LeadingProduct INNER JOIN #tempBasket AS CorrelatedProducts

ON LeadingProduct.FK_SalesOrderInfo = CorrelatedProducts.FK_SalesOrderInfo

WHERE LeadingProduct.FK_Product = @OrientationProductID AND CorrelatedProducts.FK_Product != @OrientationProductID

DELETE FROM #ProductOrientation WHERE FK_Product = @OrientationProductID

SET @OrientationProductID = (SELECT TOP(1) FK_Product FROM #ProductOrientation)

End

SET @FK_SalesOrderInfo = (SELECT TOP(1) FK_SalesOrderInfo FROM #SalesOrderOrientation)

DELETE TOP(1) FROM #SalesOrderOrientation

DROP TABLE #tempBasket

DROP TABLE #ProductOrientation

END

DROP TABLE #SalesOrderOrientation

You now have have a set of all products that are sold together. You now just have to aggregate the results before inserting them into your final MarketBasketAnalysis table.

We do this with the following T-SQL statement:

TRUNCATE TABLE MarketBasketAnalysis

GO

INSERT INTO MarketBasketAnalysis ([FK_OrderDate], [FK_SalesChannel], [FK_SpecialOffer], [FK_ProductA], [FK_ProductB], [OrderQtyA], [OrderQtyB], [GrossProfitA], [GrossProfitB])

SELECT [FK_OrderDate]

,[FK_SalesChannel]

,[FK_SpecialOffer]

,[FK_ProductA]

,[FK_ProductB]

,SUM([OrderQtyA]) AS OrderQtyA

,SUM([OrderQtyB]) AS OrderQtyB

,SUM([GrossProfitA]) AS GrossProfitA

,SUM([GrossProfitB]) AS GrossProfitB

FROM [dbo].[ProductCorrelations]

GROUP BY

[FK_OrderDate]

,[FK_SalesChannel]

,[FK_SpecialOffer]

,[FK_ProductA]

,[FK_ProductB]

And we are done! We can now see what products combinations are selling most often together. This information can give you better judgment on what the effect of special offers will be, what products go well into packaged deals, how to arange your store layout, or how to advertise products on your webshop.

## Concerns about the algorithm

The algorithm took about 12 minutes to complete on my laptop, which is acceptable. But the product catalogue and sales data set of AdventureWorks is not that big, I can imagine that this algorithm might take to long for the data of example a supermarket. The AdventureWorks database has 500+ products and 120.000+ order lines. This resultat into 1.984.276 sales combinations. As you either increase the number of products or the number of order lines the number of sales combinations will increase significantly.

Performance gains can be reached by reducing disk I/O by keeping more data in memory, do incremental loads, reduce the amount of products you will evaluate (take the most profitable products) or make a smarter algorithm :-).

If you have any suggestions about this feel free to post a comment!