Tuesday, October 19, 2010

Majority Late Arriving Fact Lookups in SSIS

Usually when I load data into a data warehouse I retrieve only the changes.  Since changes are normally applied to the most recent records doing a lookup on the natural key of the current record and a partial lookup for any that are not associated with that record for type 2 works out well.  I recently had a situation where I needed to reprocess the entire table for every run.  We won’t go into why this was the case.  Needless to say it’s not good.  Consequently performance was horrendous because 70% of the lookups were partial. 
My solution was to use a Merge and Conditional Split to look at the entire dimension table.
image
Let’s start with the dimension (OLE_SRC Dimension).  We’ll use DimStudent as the dimension.  Here’s the query I used
Select StudentID, StudentNaturalKey, EffectiveStartDate, 
COALESCE((SELECT MIN(EffectiveStartDate) FROM DW.DIMstudent where 
EffectiveStartDate>s.EffectiveStartDAte and StudentNaturalKey=s.StudentNaturalKey),'12/31/2099') NextEffectiveStartDate
FROM DW.DimStudent s
ORDER BY StudentNaturalKey



I’m pulling the surrogatekey (StudentID), Natural Key (StudentNaturalKey) , EffectiveStartDate, and determining the NextEffectiveStartDate instead of using EffectiveEndDate because the data warehouse may have gaps or overlap in the dates.  I’m going to join on the NaturalKey in the Merge Transformation so I’m using it to order by.

This is the source import query

SELECT DISTINCT StudentNaturalKey, RecordDate
From Import.Student WITH (NOLOCK)
Order by StudentNaturalKey


I’m pulling back the NaturalKey and RecordDate from the source and ordering by StudentNaturalKey for the Merge Transformation.


Here’s the Merge Transformation joining on natural key


image


Next there’s the conditional split with the following condition to determine the correct record


ISNULL(RecordDate) || ISNULL(StudentID) || (RecordDate >= EffectiveStartDate && RecordDate < NextEffectiveStartDate)



If RecordDate is null then the source record has no date and consequentially there is no corresponding record in the dimension table.  If StudentID is null then there was no corresponding record in the dimension.  Otherwise it checks to see if the RecordDate is between the EffectiveStartDate and the NextEffectiveStartDate.

I then load the matching records into a cache connection manager.  This isn’t the only way but because of the complexity of the transformation dataflow I’d have to use the sort transformation for the merges so caching and then using the lookup transformation performed much better.

image

The cache consists of the natural key, record date, and StudentID.  I look up on the natural key and record date to get the surrogate key.  This allows me to keep the number of records to a minimum as records are often loaded in batches with the same record date.

No comments:

Post a Comment