![](/img/trans.png)
[英]Most efficient way to convert date strings to a pandas time series index
[英]Most efficient way to join two time series
想象一下,我有一個這樣的表:
CREATE TABLE time_series (
snapshot_date DATE,
sales INTEGER,
PRIMARY KEY (snapshot_date));
使用這樣的值:
INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;
我希望能夠做到這一點:
SELECT a.snapshot_date,
AVG(b.sales) AS sales_avg,
COUNT(*) AS COUNT
FROM time_series AS a
JOIN time_series AS b
ON a.snapshot_date > b.snapshot_date
GROUP BY a.snapshot_date
這會產生如下結果:
*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
| 2017-01-02 | 10.0 | 1 |
| 2017-01-03 | 7.0 | 2 |
| 2017-01-04 | 9.0 | 3 |
| 2017-01-05 | 8.5 | 4 |
| 2017-01-06 | 9.8 | 5 |
-------------------------------------
如此示例中的行數非常少,查詢運行速度非常快。 問題是我必須為數百萬行執行此操作,而在Redshift上(類似於Postgres的語法),我的查詢需要數天才能運行。 它非常慢,但這是我最常見的查詢模式之一。 我懷疑問題是由於數據中O(n ^ 2)的增長與更優選的O(n)的增長。
我在python中的O(n)實現將是這樣的:
rows = [('2017-01-01',10),
('2017-01-02',4),
('2017-01-03',13),
('2017-01-04',7),
('2017-01-05',15),
('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
snapshot_date = row[0]
sales = row[1]
if index == 0:
sales_total_previous += sales
continue
count += 1
sales_avg = sales_total_previous / count
print((snapshot_date,sales_avg, count))
sales_total_previous += sales
使用這樣的結果(與SQL查詢相同):
('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)
我正在考慮切換到Apache Spark,以便我可以完成那個python查詢,但是幾百萬行並不是那么大(最多3-4 GB)並且使用具有100 GB RAM的Spark集群似乎矯枉過正。 有一種高效且易於閱讀的方式我可以在SQL中獲得O(n)效率,最好是在Postgres / Redshift中嗎?
你似乎想要:
SELECT ts.snapshot_date,
AVG(ts.sales) OVER (ORDER BY ts.snapshot_date) AS sales_avg,
ROW_NUMBER() OVER (ORDER BY ts.snapshot_date) AS COUNT
FROM time_series ts;
你會發現使用窗口函數效率更高。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.