Pandas append() and remove duplicates messes up the index

Question

I am scraping some data from a basketball site and the plan is to automatically update it when new data is added.

I get the data

stats = pd.read_html('URL', header=[0, 1])
player_stats = stats[4]

player_stats.to_csv('stats.csv')

Append it

with open('stats.csv', 'a') as f:
    player_stats.to_csv(f, header=False)

Remove duplicates (method 1)

old_data = pd.read_csv('stats.csv')
data = old_data.drop_duplicates(subset='Unnamed: 1_level_0', keep='last')
data.to_csv('stats.csv')

Remove duplicates (method 2)

old_data = pd.read_csv('stats.csv')
bool_series = data["Unnamed: 1_level_0"].duplicated(keep='last') 
bool_series 
data = data[~bool_series] 
data.to_csv('stats.csv')

The problem I face is that after the original data is appended with the new data the remove duplicate method messes up the structure of the file making future appending and removing duplicates impossible as duplicates are not recognized as such anymore...

Why is the new index added and how do I fix that?

Answer 1

Instead of appending data directly to the file, make use of Panda's function concat() . Be aware of argument axis.

Pandas append() and remove duplicates messes up the index

Question

1 answers

solution1
0 ACCPTED 2019-11-13 11:50:57

Pandas append() and remove duplicates messes up the index

Question

1 answers

solution1 0 ACCPTED 2019-11-13 11:50:57

solution1
0 ACCPTED 2019-11-13 11:50:57