简体   繁体   中英

How to store data where source is in seconds resolution

I've got some financial tick data that is only at second resolution. The data itself is ordered in chronological order.

Date        Time      Bid     Ask
06/07/2015  19:09:29  0.7623  0.76262
06/07/2015  19:09:29  0.7623  0.76271
06/07/2015  19:09:29  0.7623  0.76276

I'm looking to do some analysis that requires me to be able know the sub-second order. So my initial thought was to "fake" the milliseconds. When there are multiple ticks (datapoints) within the same second, I'll just assume that they are equally distributed within that second. So since there were 3 ticks within the same second, I'd assume that the first happens at the start of the second, the second at .333, the third at .666.

Two questions.

  1. Is this approach (faking the milliseconds) the best way to store the actual order?

  2. I have a fairly large amount of data to import, so was thinking I would operate in a two phase process of uploading the raw data, then inserting into my destination table with a select that does the millisecond calculation. So any help with the query/approach below would be appreciated.


CREATE TEMPORARY TABLE dataload (
    id serial
  , dt date
  , tm time
  , bid numeric(10,5)
  , ask numeric(10,5)
);

COPY dataload (dt, tm, bid, ask) FROM '/path/to/data.csv' WITH CSV HEADER;

-- INSERT INTO actual_table
SELECT
    dt
  , tm
  , (dt||tm)::timestamp -- Need to hack the milliseconds here
from dataload
group by dt, tm;

If you cannot get improved data, it is better to model it in a way that matches the reality behind it.

So, adding fake seconds is possibly not a good idea, as you will obscure the real situation.

Instead you can add a field with rank of an order within a second. This is can be easily done with row_number() over (partition by date, time) .

You can always turn that number to a part of the second if that helps to reduce the size of the data.

I'm generally not a fan of faking up data. It would be more accurate to store exactly what you're given, and use a second column to indicate ordering. It will cost some extra storage, but the lack of confusion will probably more than make up for that.

I also wouldn't store time separate from date unless you have a really good reason to.

It would be nice to use file_fdw instead of an intermediate table, to avoid the additional copy step. The problem is that file_fdw doesn't give you row numbers or any other means of ordering to allow you to use row_number() to create the sub-second ordering. So we need to use a function.

CREATE TABLE tick(
  tick_timestamp      timestamptz      NOT NULL
  , tick_sequence     smallint         NOT NULL
  , CONSTRAINT tick__pk_tick_timestamp__tick_sequence PRIMARY KEY( tick_timestamp, tick_sequence )
  , bid               numeric(10,5)    NOT NULL
  , ask               numeric(10,5)    NOT NULL
);
CREATE FOREIGN TABLE tick_csv( tick_date date, tick_time time, bid numeric, ask numeric ) ...;
CREATE FUNCTION get_tick() RETURNS SETOF tick_csv LANGUAGE sql AS $$SELECT * FROM tick_csv$$;
COMMENT ON FUNCTION get_tick() IS $$This function is necessary because file_fdw provides no means of obtaining row numbers, which we need for safe ordering. This function allows use of WITH ORDINALITY to get a row number.$$;
INSERT INTO tick( tick_timestamp, tick_sequence, bid, ask )
  SELECT tick_timestamp
      -- Generate a series that resets for each second
      , row_number() OVER( PARTITION BY tick_timestamp ORDER BY ordinality )
      , bid
      , ask
    FROM (
      SELECT *, ( tick_date + tick_time ) AT TIME ZONE 'America/New_York' AS tick_timestamp -- CHANGE TO CORRECT TIMEZONE
        FROM get_ticks() WITH ordinality
    ) raw
;

You might be tempted to omit the function bit and use something like a temporary sequence to generate a row number. The problem with that is it doesn't protect against the planner creating a plan that re-orders the output of the foreign table. That shouldn't happen with something simple that just does a sequential scan, but if you decided to join to some other table then you could easily get out-of-order rows. The reason this SQL function is safe is because it won't be inlined , since we're using WITH ORDINALITY (and because it's not defined as STABLE, which you probably could do).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM