简体   繁体   中英

Pandas append() and remove duplicates messes up the index

I am scraping some data from a basketball site and the plan is to automatically update it when new data is added.

I get the data

stats = pd.read_html('URL', header=[0, 1])
player_stats = stats[4]

player_stats.to_csv('stats.csv')

Append it

with open('stats.csv', 'a') as f:
    player_stats.to_csv(f, header=False)

Remove duplicates (method 1)

old_data = pd.read_csv('stats.csv')
data = old_data.drop_duplicates(subset='Unnamed: 1_level_0', keep='last')
data.to_csv('stats.csv')

Remove duplicates (method 2)

old_data = pd.read_csv('stats.csv')
bool_series = data["Unnamed: 1_level_0"].duplicated(keep='last') 
bool_series 
data = data[~bool_series] 
data.to_csv('stats.csv')

The problem I face is that after the original data is appended with the new data the remove duplicate method messes up the structure of the file making future appending and removing duplicates impossible as duplicates are not recognized as such anymore...

Why is the new index added and how do I fix that?

Instead of appending data directly to the file, make use of Panda's function concat() . Be aware of argument axis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM