简体   繁体   中英

MySQL time series database, track quantity/price/data history — insert a new row only if a new value is different from the previous one?

I am trying to do time series product database that tracks a product stock quantities (100k+ products). It will be updated from a CSV file every 30 min and I only want to insert a new record if the AvailQuant or the AvailNextQuant has changed. Every new source CSV file has a new date and a time on every row. Some stock quantities might change only once per month so no point to insert a new duplicate row every 30 min when only the time is different. There must be some easy obvious way to do this as I would think that this is quite a common thing to do (price history tracking sites etc, update only when price change).

Columns are as follows: ProductID, AvailQuant, AvailDate, AvailTime, AvailNextQuant, AvailNextDate.

I first thought to use 3 separate tables: tmp1, tmp2 and final time series table. First LOAD DATA INFILE REPLACE into tmp1 table and then INSERT only new products and UPDATE the existing products if stock value change into tmp2 and after that from tmp2 table INSERT IGNORE into final time series table where unique index is: ProductID + Date + Time. Not sure how to archive this or is it even anyway near a correct approach? Now I also think that with the LOAD DATA INFILE I should only need one tmp table?

PS. I'm totally newbie with the MySQL so if someone knows how to do this, a little explanation with the example code would be highly appreciated.

set ProductID, AvailQuant and AvailNextQuant as the primary key. then use an insert on duplicte key. here is an example

On duplicate key ignore?

So this is what I came up with so far. Not 100% sure if it works correctly and doesn't skip any rows, it looks like it works ok when I tested it. If someone knows better and simpler way please let us know (there must be easier or simpler way)? This method is not perfect as the discontinued products will not be deleted from the temp tables. Also not sure how to test the integrity of the data and the code as there are 100k+ rows on the each file that gets loaded every 30 min?

I have set up 3 duplicate tables, tmp1, tmp2 and times_series

Step 1, tmp1: Primary key = ProductID (CSV gets imported here)

Step 2, tmp2: Primary key = ProductID (Cleans the unwanted rows)

Final, time_series: Primary key = ProductID, AvailDate, AvailTime (Holds the time series data history)

Columns are as follows: ProductID, AvailQuant, AvailDate, AvailTime, AvailNextQuant, AvailNextDate.

Step 1, First we need to get the data from CSV (Tab delimited) to the database. Load data infile from CSV file to tmp1. REPLACE command and the ProductID as a primary key will replace already existing Products and INSERT new ones that don't exist in the database. Discontinued products will not be deleted from tmp1. We only want the latest data, that's why to replace.

sql1 = ”LOAD DATA LOCAL INFILE ’csv_file.txt’
       REPLACE
       INTO TABLE tmp1 
       FIELDS TERMINATED BY '\t' 
       ENCLOSED BY ''
       LINES TERMINATED BY '\n'
       IGNORE 1 ROWS";

Step 2, Then we need to compare tmp1 ProductID, AvailQuant and AvailNextQuant to the tmp2 table and select and replace only the changed rows from the tmp1 to the tmp2. Again REPLACE command and the ProductID as a Primary key will replace the old rows with new and the new rows (products) that didn't exist before will be inserted into the tmp2 as well. Discontinued products will not be deleted from tmp2. Without step 2, the tmp1 table would have contained new rows that have only different date and time, this would have caused time series data to have duplicate rows with only a different date. This data is ready for the time series table because it only contains new changed rows and the existing rows that didn't change. Existing rows that didn't change will be ignored on the final insert.

sql2 = ”REPLACE tmp2
       SELECT tmp1.*
       FROM tmp1 LEFT OUTER JOIN tmp2
       ON tmp1.ProductID=tmp2.ProductID
       AND tmp1.AvailQuant=tmp2.AvailQuant
       AND tmp1.AvailNextQuant=tmp2.AvailNextQuant
       WHERE tmp2.ProductID IS NULL”;

Finally, We can insert and ignore from the tmp2 to the time_series table. Because primary key = (ProductID, date, time) IGNORE will ignore errors from duplicate rows that are in the time series table but has not changed in the tmp2.

sql3 = ”INSERT IGNORE INTO time_series
       SELECT * FROM tmp2”;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM