简体   繁体   中英

Pandas to_sql - append vs replace

I'm trying to understand how to modify the to_sql function to my needs. Here's the dataframe df_interface :

| YEAR | QUARTER | USER_ACCOUNT | BYTES         | USER_CODE |
|------|---------|--------------|---------------|-----------|
| 2020 | 2       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 2       | Exel2        | 441712306685  | 348995    |

I'm trying to insert this into a table USER_USAGE (via oracle+cx and SQLAlchemy). The contents of this table before the insert are:

| YEAR | QUARTER | USER_ACCOUNT | BYTES         | USER_CODE |
|------|---------|--------------|---------------|-----------|
| 2020 | 1       | SHtte34      | 34560         | 2320885   |
| 2020 | 1       | Exel2        | 5478290       | 348995    |

I would like a new row inserted only in the case of a new quarter AND account. Basically I would like this after the insert:

| YEAR | QUARTER | USER_ACCOUNT | BYTES         | USER_CODE |
|------|---------|--------------|---------------|-----------|
| 2020 | 1       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 1       | Exel2        | 5478290       | 348995    |
| 2020 | 2       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 2       | Exel2        | 441712306685  | 348995    |

Here's the code with "replace":

conn = create_engine('oracle+cx_oracle://{}:{}@{}/?service_name={}'.format(s_uid,s_pwd,s_db,s_service))

df_interface.to_sql('USER_USAGE', conn, if_exists='replace',dtype={'USER_ACCOUNT': types.String(df_interface.USER_ACCOUNT.str.len().max()),'USER_CODE': types.String(df_interface.USER_CODE.str.len().max())},index=False)

This seems to be removing the previous quarter(1) values as well. Output after replace:

 | YEAR | QUARTER | USER_ACCOUNT | BYTES         | USER_CODE |
|------|---------|--------------|---------------|-----------|
| 2020 | 2       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 2       | Exel2        | 441712306685  | 348995    |

Append is closer to what I want to see, however if I accidentally run the program twice, I'm seeing duplicated rows:

| YEAR | QUARTER | USER_ACCOUNT | BYTES         | USER_CODE |
|------|---------|--------------|---------------|-----------|
| 2020 | 1       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 1       | Exel2        | 5478290       | 348995    |
| 2020 | 2       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 2       | Exel2        | 441712306685  | 348995    |
| 2020 | 1       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 1       | Exel2        | 5478290       | 348995    |
| 2020 | 2       | SHtte34      | 7392577516389 | 2320885   |
| 2020 | 2       | Exel2        | 441712306685  | 348995    |

How do I use the "append" but also prevent duplicates from getting created in case of an inadvertent run?

The if_exists argument refers to the table as a whole, not individual rows within the table. if_exists="replace" means "if the table exists then drop it and create a new one with the rows in the DataFrame, whereas if_exists="append" means "append the DataFrame rows to the existing table".

If potentially you only want to insert some (or none) of the rows into an existing table then you can't use to_sql to insert them directly. Instead, you could:

• Create a temporary table (eg, USER_USAGE_TEMP ) with the same structure as the main USER_USAGE table.

• Use to_sql to upload the DataFrame to the temp table (with if_exists="append" ).

• Execute an INSERT statement like

INSERT INTO USER_USAGE (YEAR, QUARTER, USER_ACCOUNT, BYTES, USER_CODE)
SELECT YEAR, QUARTER, USER_ACCOUNT, BYTES, USER_CODE FROM USER_USAGE_TEMP
WHERE NOT EXISTS (
    SELECT * FROM USER_USAGE UU
    WHERE UU.YEAR = USER_USAGE_TEMP.YEAR AND UU.QUARTER = USER_USAGE_TEMP.QUARTER
    )

'This problem can also be solved by first selecting the already existing table from the database, then appending your dataframe with pd.concat() , and then inserting the new dataframe with to_sql() .

    new_df = pd.DataFrame('''The new data you want to append''')

def query(sql: str) -> pd.DataFrame:
    with get_engine().connect() as con:
        df = pd.read_sql(text(sql), con)
    return df

df_from_database = query('''SELECT * FROM db''')
new_df = pd.concat([new_df, df_from_database], verify_integrity=True)

# Verify_integrity checks for duplicates. Another solution is to use drop_duplicates()

new_df.to_sql('table_name', con, if_exists='replace')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM