Handling Begin and End Times in SQL Server Analysis Services 2008

Introduction

A common Analysis Services scenario is to handle events with a begin time and end time and then want to know what is happening at any point in time. For example, when people watch TV there is a time they turn it on and off. What broadcasters want to know is how many viewers are watching at any given time – often down to the minute level. There are many other examples:  issue tracking (opened date, closed date, assigned to, etc), flight arrival and departure. The list goes on, but we’ll stick to TV viewing.

There is a common way to handle this – chop up the events into slices at the lowest level of detail. But this can dramatically increase the amount of data. For example, someone watching TV for a couple hours generates over 100 transactions this way. Multiply this by millions of viewers over several months and concerns about handling large data volumes arise. This document discusses a slightly altered approach to handle the data scale issues. (Special thanks for Thomas Kejser contributing.)

Requirements and Assumptions

I’ll keep things simple and strip away everything not germane to the approach to outline the concept. We have original records with only a begin time and an end time and a box ID. The box ID identifies a customer’s receiver. There is much, much more information such as channel, demographic information and such but this extra information is not relevant to the scenario.

The requirements are straightforward. I want to know:

-          Distinct viewers watching at any point in time to the minute

-          Average viewers watching over any interval

And my users live in excel (I’ll get to why this is important later).

A distinct viewer means that if you turned the TV on once or more in some span of time, I want to count you once. The average number of viewers for a given period is computed by multiplying each viewing in that time period by its viewing time and dividing the result by the time period.

So, for example, if I had one viewer watch 2 hours of television over 24 hours, I have one distinct viewer and an average viewers of  (1 viewer * 2 hours) /( 24 hours) = 0.084 average viewers for that day.

I’m going to further assume that each viewing is on the order of an hour.

I’m also going to assume that large data volumes prevent me from creating transactions for each minute – it’s just not feasible to build a 10+ terabyte database. For smaller data volumes it may be simpler to create a transaction at the grain in time that requirements dictate.

The approach

Each record with a begin time and end time is chopped to an hour granularity and given an Interval ID and HourID. The HourID is simply the hour of the day from 1 to 24. The IntervalID maps to a value in a IntervalCombinations table that contains all possible minute intervals;  eg, 0-0, 0-1,0-2,…0-59, 1-1, 1-2,…, 1-59, … 59-59. For example, a single record:

Box Id

Begin Time

End Time

34

2:00

3:05

 

Generates the following records in the new fact table:

Box Id

Hour Id

IntervalId

34

2

59

34

3

5

 

The IntervalId is a foreign key into our Interval table. The schema is displayed below.

Figure 1 Schema

A separate dimension table, Minute, contains the minutes of the hour; it has 60 rows. A fact table maps the IntervalMinutes to each Minute. So, for example, the Interval representing minutes 7-10 has 4 rows in this fact table; one for each minute in the interval.

Then create a many-to-many dimension between the event fact table and the Minute table:

Figure 2 Dimension Usage

When browsing by minute, all events with intervals that span that minute are included.

Now I can browse data by the minute and my cube size is almost 2 orders of magnitude smaller than using the conventional approach of creating transactions for each minute.

I’ve separated out the MinuteInterval in its own dimension mainly for clarity, but the two dimensions can be merged (the granularity attribute would be a composite of IntervalId and HourID).

A distinct count column on BoxID gives us the distinct viewers. The calculation to get the average number of viewers for an arbitrary time period spanning is as follows:

create dynamic set currentcube.ExistingMinutes as

descendants([Minute Interval].[Hierarchy].currentmember,,leaves);

 

create member currentcube.measures.AverageViewers as

sum(ExistingMinutes,[Measures].[Box ID]) / count(ExistingMinutes);

 

The original expression for AverageViewers is a straightforward expression. The dynamic set and assignment is required because of how Excel deals with filter multi-selects. As Mosha explains here, query sets and dynamic sets are existed with subselects. So the expression with the dynamic set respects excel’s subselects and returns the expected results (this is a dark corner of MDX that is hopefully cleaned up in the next release).