[英]Compare List against CSV file
I have an RSS feed I want to grab data from, manipulate and then save it to a CSV file. 我有一个RSS feed,我想从中提取数据,进行处理,然后将其保存到CSV文件中。 The RSS feed refresh rate is a big window, 1 minute to several hours, and only hold 100 items at a time.
RSS feed刷新率是一个大窗口,从1分钟到几个小时不等,一次只能容纳100个项目。 So to capture everything, Im looking to have my script run every minute.
因此,为了捕获所有内容,我希望每分钟运行我的脚本。 The problem with this is if the script runs before the feed updates I will be grabbing past data which lead to adding duplicate data to the CSV.
问题是如果脚本在提要更新之前运行,我将获取过去的数据,从而导致将重复数据添加到CSV中。
I tried looking at using examples mentioned here but it kept erroring out. 我尝试使用此处提到的示例进行查看,但始终出错。
Data Flow: RSS Feed --> Python Script --> CSV file 数据流:RSS Feed-> Python脚本-> CSV文件
Sample data and code below: 下面的示例数据和代码:
Sample Data from CSV: CSV中的样本数据:
gandcrab,acad5fc7ebe8c6979d98cb8537e3a247,18bb2c3b82649314dfd45a379058869804954276,bf0ac94c6ae6f1ecfcccc049ae2373bfc659b2efb2e48e824e2e78fb43b6ebef,54,C
Sample Data from list: 清单中的样本数据:
zeus,186e84c5fd7da7331a62f1f13b1f4608,3c34aee767859fd75eb0c8c701716cbfd5655437,05c8e4f01ec8d4e6f4595db93bbcc0f85386c9f1b82b5833d983c9092640573a,49,C
Code for comparing: 用于比较的代码:
if trends_f.is_file():
with open('trendsv3.csv', 'r+', newline='') as csv_file:
h_reader = csv.reader(csv_file)
next(h_reader) #skip reading header of csv
#should i load the csv into a list then compare it with diff() against the other list?
#or is there an easier, faster, more efficient way?
I would recommending downloading everything into a CSV, and then deduplicating in batches (eg nightly) that generates a new "clean" CSV for whatever you're working on. 我建议将所有内容下载到CSV中,然后分批进行重复数据删除(例如,每晚),以针对您的工作生成新的“干净” CSV。
To dedup, load the data in with the pandas library and then you can use the function drop_duplicates
on the data. 要进行重复数据删除,请使用pandas库加载数据,然后可以对数据使用
drop_duplicates
函数。
http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Adding the ID from the feed seemed to make things the easiest to check against. 从提要中添加ID似乎使事情最容易检查。 Thank @blhsing for mentioning that.
感谢@blhsing提及这一点。 Ended reading the IDs from the csv into a list and checking the new data's IDs against that.
结束了将csv中的ID读取到列表中,并对照此检查新数据的ID。 There may be a faster more efficient way, but this works for me.
也许有一种更快,更有效的方法,但这对我有用。
Code to check csv before saving to it: 保存之前先检查csv的代码:
if trends_f.is_file():
with open('trendsv3.csv', 'r') as csv_file:
h_reader = csv.reader(csv_file, delimiter=',')
next(h_reader, None)
for row in h_reader:
csv_list.append(row[6])
csv_file.close()
with open('trendsv3.csv', 'a', newline='') as csv_file:
h_writer = csv.writer(csv_file)
for entry in data_list:
if entry[6].strip() not in csv_list:
print(entry[6], ' is not in the list, saving ', entry[6],' to the list')
h_writer.writerow(entry)
else:
print(entry[6], ' is in the list')
csv_file.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.