简体   繁体   中英

SQL/Python: Transform data from csv and into table with different schema with condition

So, I have a csv file containing data like this:

id       type      sum_cost         date_time
--------------------------------------------------
a1        pound     500        2019-04-21T10:50:06    
b1        euro      100        2019-04-21T10:40:00    
c1        pound     650        2019-04-21T11:00:00    
d1        usd       410        2019-04-21T00:30:00     

What I want to do is to insert these data into a database table where the schema is not the same as the csv such that the column in table have like this:

_id , start_time, end_time, pound_cost, euro_cost, count

where I insert from csv to this table such that, id = id , start_time is date_time - 1 hour , end_time is date_time - 30 minutes . For pound_cost and euro_cost , if type is pound insert the value from its sum_cost into pound_cost and add 0 to euro_cost . The same way to euro. and add 1 to the count column.

So, the result of the table will be like this:

_id   start_time           end_time              pound_cost  euro_cost  count
-----------------------------------------------------------------------------
 a1  2019-04-21T09:50:06  2019-04-21T10:20:06      500           0        1
 b1  2019-04-21T09:40:06  2019-04-21T10:10:00       0           100       1
 c1  2019-04-21T10:00:00  2019-04-21T10:30:00      650           0        1
 d1  2019-04-20T23:30:00  2019-04-21T00:00:00       0           410       1

So, how should I insert data to table respect to how I transform values from csv to the table. This is my first time using postgresql and I did not use sql that much so I wonder if there is a function that can do this. Or if not, how can I use Python to transform data and insert them to the table.

Thank you.

As discussed over comments, you may easily accomplish this by using COPY command and a temporary table to hold your data from the file.

Create a temporary table with the structure of your CSV,note that all are of text datatypes. This makes the copying faster as the validations are minimised.

CREATE TEMP TABLE  temptable 
      ( id TEXT ,
        TYPE TEXT,
        sum_cost TEXT ,
        date_time TEXT );

Use COPY to load from the file into this table. If you are loading the file from a server, use COPY , If it's in a client machine use psql's \\COPY . Change it to a different delimiter appropriately if needed.

\COPY temptable from '/somepath/mydata.csv'  with delimiter ',' CSV HEADER;

Now, simply run an INSERT INTO .. SELECT using expressions for various transformations.

INSERT INTO maintable (
          _id,start_time,end_time,pound_cost,euro_cost,count )
SELECT id,
     date_time::timestamp - INTERVAL '1 HOUR', 
     date_time::timestamp - INTERVAL '30 MINUTES',
  CASE type
      WHEN 'pound' THEN sum_cost::numeric
     ELSE 0 END,
  CASE type when 'euro' THEN sum_cost::numeric --you have not specified what 
                                               --happens to USD,use it as required.
     ELSE 0 END, 
   1 as count       -- I have hardcoded it based on your info, not sure what it 
                    --actually means
from temptable t; 

Now, the data is in your main table

select * from maintable ;

 _id |     start_time      |      end_time       | pound_cost | euro_cost | count
-----+---------------------+---------------------+------------+-----------+-------
 a1  | 2019-04-21 09:50:06 | 2019-04-21 10:20:06 |        500 |         0 |     1
 b1  | 2019-04-21 09:40:00 | 2019-04-21 10:10:00 |          0 |       100 |     1
 c1  | 2019-04-21 10:00:00 | 2019-04-21 10:30:00 |        650 |         0 |     1
 d1  | 2019-04-20 23:30:00 | 2019-04-21 00:00:00 |          0 |         0 |     1

Here's how you might be able to reshape data for your specification:

import os
import pandas as pd
import datetime as dt

dir = r'C:\..\..'
csv_name = 'my_raw_data.csv'
full_path = os.path.join(dir, csv_name)
data = pd.read_csv(full_path)

data = pd.read_csv(full_path)

def process_df(dataframe=data):
    df1 = dataframe.copy(deep=True)
    df1['date_time'] = pd.to_datetime(df1['date_time'])
    df1['count'] = 1

    ### Maybe get unique types to list for future needs
    _types = df1['type'].unique().tolist()

    ### Process time-series shifts
    df1['start_time']  = df1['date_time'] - dt.timedelta(hours=1, minutes=0)
    df1['end_time']  = df1['date_time'] - dt.timedelta(hours=0, minutes=50)

    ## Create conditional masks for the dataframe
    pound_type = df1['type'] == 'pound'
    euro_type = df1['type'] == 'euro'

    ### Subsection each dataframe by currency; concatenate results
    df_p = df1[df1['type'] == 'pound']
    df_e = df1[df1['type'] == 'euro']
    df = pd.concat([df_p, df_e]).reset_index(drop=True)

    ### add conditional columns
    df['pound_cost'] = [x if x == 'pound' else 0 for x in df['type']]
    df['euro_cost'] = [x if x == 'euro' else 0 for x in df['type']]

    ### Manually input desired field arrangement
    fin_cols = [
        'id',
        'start_time',
        'end_time',
        'pound_cost',
        'euro_cost',
        'count',
        ]
    ### Return formatted dataframe
    return df.reindex(columns=fin_cols).copy(deep=True)

data1 = process_df()

Output:

   id          start_time            end_time pound_cost euro_cost  count
0  a1 2019-04-21 09:50:06 2019-04-21 10:00:06      pound         0      1
1  c1 2019-04-21 10:00:00 2019-04-21 10:10:00      pound         0      1
2  b1 2019-04-21 09:40:00 2019-04-21 09:50:00          0      euro      1

To load to the main SQL table, you'd have to get a connection with SQLAlchemy or pyodbc. Then, assuming all data types match, you should be able to utilize pandas.DataFrame.append() to add data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM