简体   繁体   中英

What's my best approach re: creating calculated tables based on scraped data

I have a few spiders running on my vps to scrape data each day and the data is stored in MySQL.

I need to build a pretty complicated time series model on the data from varies data sources.

Here I run into an issue which is that:

I need to create a new calculated table based on my scraped data. The model is quite complicated as it involves historical raw data and calculated data. I was going to write a python script to do this, but it seems not efficient enough.

I then realize that I can just create a view within MySQL and write my model in the format of a nested sql query. That said, I want the view to be materialized ( which is not supported by MySQL now) , and the view can be refreshed each day when new data comes in.

I know there is a third party plugin called flex*** , but i searched online and it seems not easy to install and maintain.

What is my best approach here?

Thanks for the help.

=========================================================================

To add some clarification, the time series model I made is very complicated, it involves:

  • rolling average on raw data
  • rolling average on the rolling averaged data above

So it depends on both the raw data and previously calculated data.

The timestamp solution does not really solve the complexity of the issue.

I'm just not sure about the best way to.

Leaving aside whether you should use a dedicated time-series tool such as rrdtool or carbon , mysql provides the functionality you need to implement a semi-materialized view, eg given data batch consolidated by date:

SELECT DATE(event_time), SUM(number_of_events) AS events, 
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
  SELECT pc.date AS event_time, events AS number_of_events
  , total AS metric
  FROM pre_consolidated pc
  UNION
  SELECT rd.timestamp, 1
  , rd.metric
  FROM raw_data rd
  WHERE rd.timestamp>@LAST_CONSOLIDATED_TIMESTAMP 
) 
GROUP BY DATE(event_time)

(note that although you could create this as a view and access that, IME, MySQL is not the best at optimizing queries involving views and you might be better using the equivalent of the above as a template for building your queries around)

The most flexible way to maintain an accurate record of @LAST_CONSOLIDATED_TIMESTAMP would be to add a state column to the raw_data table (to avoid locking and using transactions to ensure consistency) and an index on the timestamp of the event, then, periodically:

UPDATE raw_data 
SET state='PROCESSING' 
WHERE timestamp>=@LAST_CONSOLIDATED_TIMESTAMP
AND state IS NULL;

INSERT INTO pre_consolidated (date, events, total)
SELECT DATE(rd.timestamp), COUNT(*), SUM(rd.metric)
FROM raw_data
WHERE timestamp>@LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING'
GROUP BY DATE(rd.timestamp);

SELECT @NEXT_CONSOLIDATED_TIMESTAMP := MAX(timestamp)
FROM raw_data
WHERE timestamp>@LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';

UPDATE raw_data
SET state='CONSOLIDATED'
WHERE timestamp>@LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';

SELECT @LAST_CONSOLIDATED_TIMESTAMP := @NEXT_CONSOLIDATED_TIMESTAMP;

(you should think of a way to persist LAST_CONSOLIDATED_TIMESTAMP between DBMS sessions)

Hence the base query (to allow for more than one event with the same timestamp) should be:

SELECT DATE(event_time), SUM(number_of_events) AS events, 
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
  SELECT pc.date AS event_time, events AS number_of_events
  , total AS metric
  FROM pre_consolidated pc
  UNION
  SELECT rd.timestamp, 1
  , rd.metric
  FROM raw_data rd
  WHERE rd.timestamp>@LAST_CONSOLIDATED_TIMESTAMP
  AND state IS NULL
) 
GROUP BY DATE(event_time)

Adding the state variable to the timestamp index will likely slow down the overall performance of the update as long as you are applying the consolidation reasonably frequently.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM