简体   繁体   中英

SQL - finding Sequence / Path

I have a big event occurrence table. It has following columns:

  • UserId
  • EventId (Type of event)
  • TimeStamp (When this event occurred)

I would like to know all the users that performed some event sequence between a date range.

If I am looking for events sequence 1-2-3... then event 1 should occur before 2 and 2 should occur before 3.

Currently I am just iterating over the record set using CLR stored proc. This approach is slow. Is there a better way to do it in SQL?

I am using SQl Server 2008. And there could be duplicate eventId per userId.

The Size of the table is around 3-4 billion rows and a date range could contain about 1 billion rows. Performance is critical.

Thanks

If you can pre-know the sequence you're looking for, and it's not too long, you can SELECT the subset of the table you want (to deal with date range, and pick out one event ID), join as many copies of that to itself as needed, and then SELECT rows WHERE date(event1) > date(event2) AND date(event2) > date(event3). It'd be a rather long query, which is why I'm not typing it out, but should work without being too inefficient.

EDIT: Example:

SELECT a.userID,a.date,b.date,c.date FROM
    (SELECT * FROM `events` WHERE `date` BETWEEN $date1 AND $date2 AND `type`=$type1) a
    LEFT JOIN (SELECT * FROM `events` WHERE `date` BETWEEN $date1 AND $date2 AND `type`=$type2) b ON a.userID=b.userID
    LEFT JOIN (SELECT * FROM `events` WHERE `date` BETWEEN $date1 AND $date2 AND `type`=$type3) c ON a.userID=c.userID
    WHERE a.date > b.date AND b.date > c.date

Assuming you know the exact sequence at the time of writing the query (either when you're coding it or when you caller code generates it dynamically), you can do this, as long as the sequence is not too long:

select *
from eventTable1 T1, eventTable1 T2, eventTable1 T3,
where t1.theTime between '01/01/2000' and '01/01/2001'
  and t2.theTime between '01/01/2000' and '01/01/2001'
  and t3.theTime between '01/01/2000' and '01/01/2001'
  and t1.theTime <= t2.theTime
  and t2.theTime <= t3.theTime
  and t1.eventId = 1
  and t2.eventId = 2
  and t3.eventId = 3
  and t1.userId = t2.userId
  and t1.userId = t3.userId
  and t2.userId = t3.userId -- Needed for performance reasons

This will work fairly well if you have an index on userId, theTime and the amount of rows is manageable for given time period (eg you don't get the full billion rows across a set of users)

Please note that the above can (and probably SHOULD) be further optimized depending on your data set and timeframe span, by first selecting ALL records for a given timespan into a temp table and then doing the above join on the temp table. This optimization works best if the amount of rows in a given timespan is manageable (eg <100k?) and there is an index on theTime


Another approach may be to avoid the JOIN and simply retrieve ALL the sequences combined, per user; and then do the "is this correct sequence" in the caller code:

SELECT * FROM eventTable
ORDER BY userId, theTime   -- works MUCH better if this is an covering index

And then in the caller code, you basically do a subset matching on per-user sequences (seems trivial to me but feel free to ask as a separate quetsion on SO if you aren't sure how)

Since this is pretty much per-user processing, you can avoid blowing out the memory by selecting chunks of users (approximate # of events per user, then grab as many users as would be safe for your memory - to have that work fast your SQL must support "TOP" or "LIMIT" syntax AND you should have pre-built list of all users in a temp table.

Something like this?


select userid, eventid, theTime 
from eventTable 
where theTime between '01/01/2000' and '01/01/2001'
order by theTime DESC 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM