简体   繁体   English

Sqlalchemy:从pandas 数据框中将新行添加到mysql 表中,如果它们不存在于表中

[英]Sqlalchemy: add into mysql table new rows from pandas dataframe, if they don't exist already in the table

I created a table inserting data fetched from an api and store in to a pandas dataframe using sqlalchemy.我创建了一个表,插入从 api 获取的数据并使用 sqlalchemy 存储到 Pandas 数据帧中。 I am gonna need to query the api, every 4 hours, to get new data.我需要每 4 小时查询一次 api 以获取新数据。 Problem being that the api, will give me back not only the new data but as well the old ones, already imported in mysql how can i import just the new data into the mysql table问题是 api 不仅会返回新数据,还会返回旧数据,这些数据已经导入到 mysql 中,如何仅将新数据导入到 mysql 表中

i retrieved the data from the api, stored the data in to a pandas object, created the connection to the mysql db and created a fresh new table.我从 api 中检索了数据,将数据存储到一个 Pandas 对象中,创建了到 mysql db 的连接并创建了一个新表。

import requests
import json
from pandas.io.json import json_normalize
myToken = 'xxx'
myUrl = 'somewebsite'
head = {'Authorization': 'token {}'.format(myToken)}
response = requests.get(myUrl, headers=head)
data=response.json()
#print(data.dumps(data, indent=4, sort_keys=True))
results=json_normalize(data['results'])
results.rename(columns={'datastream.name': 'datastream_name',                    
                        'datastream.url':'datastream_url',
                        'datastream.datastream_type_id':'datastream_id',
                        'start':'error_date'}, inplace=True)

results_final=pd.DataFrame([results.datastream_name,
                            results.datastream_url, 
                            results.error_date, 
                            results.datastream_id,
                            results.message,
                            results.type_label]).transpose()

from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw@ip/schema')
con = engine.connect()
results_final.to_sql(name='error',con=con,if_exists='replace')
con.close()

End goal is to insert into the table, just the not existing data coming from the api最终目标是插入到表中,只是来自 api 的不存在的数据

You could pull the results already in the database into a new dataframe and then compare the two dataframes.您可以将数据库中已有的结果提取到一个新的数据框中,然后比较这两个数据框。 After that you would only insert the rows not in the table.之后,您将只插入不在表中的行。 Not knowing the format of your table or data I'm just using a generic SELECT statement here.不知道您的表或数据的格式,我只是在这里使用通用的SELECT语句。

from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw@ip/schema')
con = engine.connect()
sql = "SELECT * FROM table_name"
old_results = pd.read_sql(sql, con)
df = pd.merge(old_results, results_final, how='outer', indicator=True)
new_results = df[df['_merge']=='right_only'][results_final.columns]
new_results.to_sql(name='error',con=con,if_exists='append')
con.close()

You also need to change if_exists to append because set to replace it drops all values in the table and replaces them with the values in the pandas dataframe.您还需要将if_exists更改为append因为设置为replace它会删除表中的所有值并将它们替换为 pandas 数据框中的值。

I developed this function to handle both: news values and when columns from the source table and target table are not equal.我开发了这个函数来处理两个问题:新闻值以及当源表和目标表中的列不相等时。

def load_data(df):
engine = create_engine('mysql+pymysql://root:pass@localhost/dw', echo_pool=True, pool_size=10, max_overflow=20)
with engine.connect() as conn, conn.begin():
    try:
        df_old = pd.read_sql('SELECT * FROM table', conn)

        # Check if exists new rows to be inserted
        
        if len(df) > len(df_old) and df_bulbr.disconnected_time.all() != df_old.disconnected_time.all():
            print("There are new rows to be inserted. ")
            
            df_merged = pd.merge(df_old, df, how='outer', indicator=True)
            df_final = df_merged[df_merged['_merge']=='right_only'][df.columns]
            df_final.to_sql(name='table',con=conn,index=False, if_exists='append')
        
    except Exception as err:
        print (str(err))
        
    else:
        # This handling errors when the lengths of the columns are not equal to the target
        if df_bulbr.shape[1] > df_old.shape[1]:
            data = pd.read_sql('SELECT * FROM table', conn)
            df2 = pd.concat([df,data])
            df2.to_sql('table', conn, index=False, if_exists='replace')
    
    outcome = conn.execute("select count(1) from table")
    countRow = outcome.first()[0]
    
return print(f" Total of {countRow} rows load." )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM