简体   繁体   English

我的最佳方法是:根据抓取的数据创建计算表

[英]What's my best approach re: creating calculated tables based on scraped data

I have a few spiders running on my vps to scrape data each day and the data is stored in MySQL. 我的vps上每天都有几个Spider来抓取数据,并将数据存储在MySQL中。

I need to build a pretty complicated time series model on the data from varies data sources. 我需要基于来自各种数据源的数据构建一个非常复杂的时间序列模型。

Here I run into an issue which is that: 在这里,我遇到了一个问题:

I need to create a new calculated table based on my scraped data. 我需要根据我的抓取数据创建一个新的计算表。 The model is quite complicated as it involves historical raw data and calculated data. 该模型非常复杂,因为它涉及历史原始数据和计算数据。 I was going to write a python script to do this, but it seems not efficient enough. 我打算编写一个python脚本来执行此操作,但是效率似乎不够。

I then realize that I can just create a view within MySQL and write my model in the format of a nested sql query. 然后,我意识到我可以在MySQL中创建一个视图并以嵌套sql查询的格式编写模型。 That said, I want the view to be materialized ( which is not supported by MySQL now) , and the view can be refreshed each day when new data comes in. 就是说,我希望视图能够实现(MySQL现在不支持该视图),并且可以在每天输入新数据时刷新视图。

I know there is a third party plugin called flex*** , but i searched online and it seems not easy to install and maintain. 我知道有一个名为flex ***的第三方插件,但我在网上搜索,安装和维护似乎并不容易。

What is my best approach here? 我最好的方法是什么?

Thanks for the help. 谢谢您的帮助。

========================================================================= ================================================== =======================

To add some clarification, the time series model I made is very complicated, it involves: 为了澄清起见,我制作的时间序列模型非常复杂,它涉及:

  • rolling average on raw data 原始数据的滚动平均值
  • rolling average on the rolling averaged data above 以上滚动平均值数据上的滚动平均值

So it depends on both the raw data and previously calculated data. 因此,它取决于原始数据和先前计算的数据。

The timestamp solution does not really solve the complexity of the issue. 时间戳记解决方案并不能真正解决问题的复杂性。

I'm just not sure about the best way to. 我只是不确定最好的方法。

Leaving aside whether you should use a dedicated time-series tool such as rrdtool or carbon , mysql provides the functionality you need to implement a semi-materialized view, eg given data batch consolidated by date: 抛开是否应该使用专用的时间序列工具(例如rrdtoolcarbon) ,mysql提供了实现半实体化视图所需的功能,例如,按日期合并的给定数据批处理:

SELECT DATE(event_time), SUM(number_of_events) AS events, 
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
  SELECT pc.date AS event_time, events AS number_of_events
  , total AS metric
  FROM pre_consolidated pc
  UNION
  SELECT rd.timestamp, 1
  , rd.metric
  FROM raw_data rd
  WHERE rd.timestamp>@LAST_CONSOLIDATED_TIMESTAMP 
) 
GROUP BY DATE(event_time)

(note that although you could create this as a view and access that, IME, MySQL is not the best at optimizing queries involving views and you might be better using the equivalent of the above as a template for building your queries around) (请注意,尽管您可以将其创建为视图并访问它,但IME,MySQL并不是最擅长优化涉及视图的查询,并且使用与上述等效的模板作为构建查询的模板可能会更好)

The most flexible way to maintain an accurate record of @LAST_CONSOLIDATED_TIMESTAMP would be to add a state column to the raw_data table (to avoid locking and using transactions to ensure consistency) and an index on the timestamp of the event, then, periodically: 维护@LAST_CONSOLIDATED_TIMESTAMP准确记录的最灵活的方法是在raw_data表中添加一个状态列(以避免锁定和使用事务以确保一致性),并在事件的时间戳上添加索引,然后定期:

UPDATE raw_data 
SET state='PROCESSING' 
WHERE timestamp>=@LAST_CONSOLIDATED_TIMESTAMP
AND state IS NULL;

INSERT INTO pre_consolidated (date, events, total)
SELECT DATE(rd.timestamp), COUNT(*), SUM(rd.metric)
FROM raw_data
WHERE timestamp>@LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING'
GROUP BY DATE(rd.timestamp);

SELECT @NEXT_CONSOLIDATED_TIMESTAMP := MAX(timestamp)
FROM raw_data
WHERE timestamp>@LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';

UPDATE raw_data
SET state='CONSOLIDATED'
WHERE timestamp>@LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';

SELECT @LAST_CONSOLIDATED_TIMESTAMP := @NEXT_CONSOLIDATED_TIMESTAMP;

(you should think of a way to persist LAST_CONSOLIDATED_TIMESTAMP between DBMS sessions) (您应该考虑一种在DBMS会话之间保留LAST_CONSOLIDATED_TIMESTAMP的方法)

Hence the base query (to allow for more than one event with the same timestamp) should be: 因此,基本查询(以允许多个事件具有相同的时间戳记)应为:

SELECT DATE(event_time), SUM(number_of_events) AS events, 
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
  SELECT pc.date AS event_time, events AS number_of_events
  , total AS metric
  FROM pre_consolidated pc
  UNION
  SELECT rd.timestamp, 1
  , rd.metric
  FROM raw_data rd
  WHERE rd.timestamp>@LAST_CONSOLIDATED_TIMESTAMP
  AND state IS NULL
) 
GROUP BY DATE(event_time)

Adding the state variable to the timestamp index will likely slow down the overall performance of the update as long as you are applying the consolidation reasonably frequently. 只要您合理地频繁应用合并,将状态变量添加到时间戳索引可能会减慢更新的整体性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM