如何在不添加重复项的情况下使用 pandas 更新 CSV 文件

Question

I'm trying to get some data off the web and it's taking a while.我正在尝试从 web 中获取一些数据，这需要一段时间。 In case anything happens I've been periodically saving the data in a csv file.万一发生任何事情，我会定期将数据保存在 csv 文件中。

However, it just appends a new copy of the dataframe to the CSV file.但是，它只是将 dataframe 的新副本附加到 CSV 文件中。 This means that there's loads of duplicates in the file.这意味着文件中有大量重复项。

df.to_csv('data.csv', mode='a', header=False)

is the command i'm using to save my progress.是我用来保存进度的命令。

Thanks for reading.谢谢阅读。

Answer 1

IIUC, you have a single dataframe to which you append to over time and which you want to back up periodically. IIUC，您有一个 dataframe 到 append 随着时间的推移，您想定期备份。

There are multiple approaches you could try:您可以尝试多种方法：

If writing the file is fast, instead of appending, just write the complete dataframe every time (writing the header potentially could be useful in this case):如果写入文件很快，而不是追加，只需每次写入完整的 dataframe （写入 header 在这种情况下可能有用）：

df.to_csv('data.csv', header=False)  # or header=True

Keep track of which lines you have already written and only append new lines:跟踪您已经编写了哪些行，并且只有 append 新行：

# (i) First time write the complete dataframe
df.to_csv('data.csv', header=False)  # or header=True

# (ii) store the length of the dataframe at that point
lines_written = len(df.index)

# More data is being added to the dataframe from the web

# (iii) append new lines to CSV file
df.iloc[lines_written:].to_csv('data.csv', mode='a', header=False)

# (iv) update the line counter
lines_written = len(df.index)

# repeat steps (iii) and (iv)

如何在不添加重复项的情况下使用 pandas 更新 CSV 文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-05 21:29:32

如何在不添加重复项的情况下使用 pandas 更新 CSV 文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-05 21:29:32

解决方案1
0 已采纳 2020-07-05 21:29:32