SQL/Python: Transform data from csv and into table with different schema with condition

Question

So, I have a csv file containing data like this:

id       type      sum_cost         date_time
--------------------------------------------------
a1        pound     500        2019-04-21T10:50:06    
b1        euro      100        2019-04-21T10:40:00    
c1        pound     650        2019-04-21T11:00:00    
d1        usd       410        2019-04-21T00:30:00

What I want to do is to insert these data into a database table where the schema is not the same as the csv such that the column in table have like this:

_id , start_time, end_time, pound_cost, euro_cost, count

where I insert from csv to this table such that, id = id , start_time is date_time - 1 hour , end_time is date_time - 30 minutes . For pound_cost and euro_cost , if type is pound insert the value from its sum_cost into pound_cost and add 0 to euro_cost . The same way to euro. and add 1 to the count column.

So, the result of the table will be like this:

_id   start_time           end_time              pound_cost  euro_cost  count
-----------------------------------------------------------------------------
 a1  2019-04-21T09:50:06  2019-04-21T10:20:06      500           0        1
 b1  2019-04-21T09:40:06  2019-04-21T10:10:00       0           100       1
 c1  2019-04-21T10:00:00  2019-04-21T10:30:00      650           0        1
 d1  2019-04-20T23:30:00  2019-04-21T00:00:00       0           410       1

So, how should I insert data to table respect to how I transform values from csv to the table. This is my first time using postgresql and I did not use sql that much so I wonder if there is a function that can do this. Or if not, how can I use Python to transform data and insert them to the table.

Thank you.

Answer 1

As discussed over comments, you may easily accomplish this by using COPY command and a temporary table to hold your data from the file.

Create a temporary table with the structure of your CSV,note that all are of text datatypes. This makes the copying faster as the validations are minimised.

CREATE TEMP TABLE  temptable 
      ( id TEXT ,
        TYPE TEXT,
        sum_cost TEXT ,
        date_time TEXT );

Use COPY to load from the file into this table. If you are loading the file from a server, use COPY , If it's in a client machine use psql's \\COPY . Change it to a different delimiter appropriately if needed.

\COPY temptable from '/somepath/mydata.csv'  with delimiter ',' CSV HEADER;

Now, simply run an INSERT INTO .. SELECT using expressions for various transformations.

INSERT INTO maintable (
          _id,start_time,end_time,pound_cost,euro_cost,count )
SELECT id,
     date_time::timestamp - INTERVAL '1 HOUR', 
     date_time::timestamp - INTERVAL '30 MINUTES',
  CASE type
      WHEN 'pound' THEN sum_cost::numeric
     ELSE 0 END,
  CASE type when 'euro' THEN sum_cost::numeric --you have not specified what 
                                               --happens to USD,use it as required.
     ELSE 0 END, 
   1 as count       -- I have hardcoded it based on your info, not sure what it 
                    --actually means
from temptable t;

Now, the data is in your main table

select * from maintable ;

 _id |     start_time      |      end_time       | pound_cost | euro_cost | count
-----+---------------------+---------------------+------------+-----------+-------
 a1  | 2019-04-21 09:50:06 | 2019-04-21 10:20:06 |        500 |         0 |     1
 b1  | 2019-04-21 09:40:00 | 2019-04-21 10:10:00 |          0 |       100 |     1
 c1  | 2019-04-21 10:00:00 | 2019-04-21 10:30:00 |        650 |         0 |     1
 d1  | 2019-04-20 23:30:00 | 2019-04-21 00:00:00 |          0 |         0 |     1

Answer 2

Here's how you might be able to reshape data for your specification:

import os
import pandas as pd
import datetime as dt

dir = r'C:\..\..'
csv_name = 'my_raw_data.csv'
full_path = os.path.join(dir, csv_name)
data = pd.read_csv(full_path)

data = pd.read_csv(full_path)

def process_df(dataframe=data):
    df1 = dataframe.copy(deep=True)
    df1['date_time'] = pd.to_datetime(df1['date_time'])
    df1['count'] = 1

    ### Maybe get unique types to list for future needs
    _types = df1['type'].unique().tolist()

    ### Process time-series shifts
    df1['start_time']  = df1['date_time'] - dt.timedelta(hours=1, minutes=0)
    df1['end_time']  = df1['date_time'] - dt.timedelta(hours=0, minutes=50)

    ## Create conditional masks for the dataframe
    pound_type = df1['type'] == 'pound'
    euro_type = df1['type'] == 'euro'

    ### Subsection each dataframe by currency; concatenate results
    df_p = df1[df1['type'] == 'pound']
    df_e = df1[df1['type'] == 'euro']
    df = pd.concat([df_p, df_e]).reset_index(drop=True)

    ### add conditional columns
    df['pound_cost'] = [x if x == 'pound' else 0 for x in df['type']]
    df['euro_cost'] = [x if x == 'euro' else 0 for x in df['type']]

    ### Manually input desired field arrangement
    fin_cols = [
        'id',
        'start_time',
        'end_time',
        'pound_cost',
        'euro_cost',
        'count',
        ]
    ### Return formatted dataframe
    return df.reindex(columns=fin_cols).copy(deep=True)

data1 = process_df()

Output:

   id          start_time            end_time pound_cost euro_cost  count
0  a1 2019-04-21 09:50:06 2019-04-21 10:00:06      pound         0      1
1  c1 2019-04-21 10:00:00 2019-04-21 10:10:00      pound         0      1
2  b1 2019-04-21 09:40:00 2019-04-21 09:50:00          0      euro      1

To load to the main SQL table, you'd have to get a connection with SQLAlchemy or pyodbc. Then, assuming all data types match, you should be able to utilize pandas.DataFrame.append() to add data.

SQL/Python: Transform data from csv and into table with different schema with condition

Question

2 answers

solution1
2 ACCPTED 2019-04-21 13:53:46

solution2
0 2019-04-21 12:11:23

SQL/Python: Transform data from csv and into table with different schema with condition

Question

2 answers

solution1 2 ACCPTED 2019-04-21 13:53:46

solution2 0 2019-04-21 12:11:23

solution1
2 ACCPTED 2019-04-21 13:53:46

solution2
0 2019-04-21 12:11:23