简体   繁体   中英

Adding a new column into Athena (Presto) table calculated by taking the difference between two rows

Over the past few weeks, I've written a pipeline that picks up all the clickstream data that is being broadcasted from a website. The pipeline makes use of AWS in the following way: S3 > EC2 (for transforms) > Athena (scanning a clean, partitioned s3). New data comes into the pipeline every 24hour and this works great - my clickstream data is easily queriable. However, I now need to add some additional columns ie time spent on each page. This can be achieved by sorting by user ID, timestamp and then taking the difference between the timestamp column of row_n1 and row_n2. So my questions are:

1) How can I do this via an SQL query? I'm struggling to get it to work, but my thinking is that once I do I can trigger this query every 24hours to run on the new clickstream data that's coming into Athena.

2) Is this a reasonable way to add additional columns or new aggregate tables? for example, build a query that runs every 24hours on new data to append to a new table.

Ideally, I don't want to touch any of the source code that's been written to do the "core" ETL pipeline

for reference my table looks similar to the following (with the new column time spent on page) :

| userID | eventNum | Category| Time | ...... | timeSpentOnPage | '103-1023' '3' 'View' '12-10-2019...' 3s

Thanks for any direction/advice that can be provided.

I'm not entirely sure what you are asking, and some example data and expected output would be helpful. For example, I don't quite understand what you mean by row_n and row_m .

I'm going to guess that you mean something like calculating the difference between the timestamps of consecutive rows. That can be achieved by a query like

SELECT
  userID,
  timestamp - LAG(timestamp, 1) OVER (PARTITION BY userID ORDER BY timestamp) AS timeSpentOnPage
FROM events

The LAG window function returns the value from a previous row ( 1 in this case means the previous row) in the window given by the window frame (in this case all rows with the same userID and sorted by timestamp ). It's kind of like GROUP BY but for each row, if that makes sense.

It wouldn't quite give you the time spent on each page, some page views would look like they were very long when in fact there was just not any activity between them (say someone browsed some, went to lunch, and browsed some more – the last page view before lunch would look like it spanned the whole lunch).


There is no way to do the equivalent of UPDATE in Athena. The closest thing is doing a "CTAS" (Create Table AS) to create a new table (which with some automation can be turned into creating new partitions for existing tables).

If you provide some more information about your data I can revise this answer with other suggestions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM