[英]How to filter a large csv file with Python 3
我有这个csv 文件,但我只需要使用 Python 3 过滤我需要的数据。
简而言之,csv 代表了许多汇总的 covid19 数据。 但我只需要其中的一部分。 我需要所有时间段,只需要意大利、瑞典、德国和法国每百万人的新死亡人数,仅此而已。
然后,我想创建另一个 CSV 给定这个证明:
日期,意大利,瑞典,德国,法国
(例如,01-Apr-2020,13.35,4.52,1.22,6.74)
我的代码如下:
cases_by_day = dict()
location = {'Italy': 0.0, 'Sweden': 0.0, 'France': 0.0, 'Germany': 0.0}
with open("data.csv") as f:
v = f.readlines()
for line in v:
elements = line.split(",")
# print(elements)
date = datetime.strptime(elements[3], "%Y-%m-%d")
cases_by_day[str(elements[3])] = location
with open("data.csv") as h:
for line in h:
a = line.split(",")
if "Italy" in a[2]:
u = str(a[3])
if len(a[15]) == 0:
cases_by_day[u]["Italy"] = 0.0
else:
# print(float(a[15]))
# print(u)
cases_by_day[u]["Italy"] = float(a[15])
# print(cases_by_day[u]["Italy"])
elif "Sweden" in a[2]:
i = str(a[3])
if len(a[15]) == 0:
cases_by_day[i]["Sweden"] = 0.0
else:
cases_by_day[i]["Sweden"] = float(a[15])
elif "France" in a[2]:
o = str(a[3])
if len(a[15]) == 0:
cases_by_day[o]["France"] = 0.0
else:
cases_by_day[o]["France"] = float(a[15])
elif "Germany" in a[2]:
p = str(a[3])
if len(a[15]) == 0:
cases_by_day[p]["Germany"] = 0.0
else:
cases_by_day.get(p)["Germany"] = float(a[15])
print(cases_by_day)
但是,在该过程结束时,每个日期键都有相同的嵌套字典,我不知道为什么
编辑: data.csv 的测试如下:
iso_code, continent, location, date, total_cases, new_cases, new_cases_smoothed, total_deaths, new_deaths, new_deaths_smoothed, total_cases_per_million, new_cases_per_million, new_cases_smoothed_per_million, total_deaths_per_million, new_deaths_per_million, new_deaths_smoothed_per_million, 16 reproduction_rate, icu_patients, icu_patients_per_million, hosp_patients, hosp_patients_per_million, weekly_icu_admissions, weekly_icu_admissions_per_million, weekly_hosp_admissions, weekly_hosp_admissions_per_million , total_tests, new_tests, total_tests_per_thousand, new_tests_per_thousand, new_tests_smoothed, new_tests_smoothed_per_thousand, positive_rate, tests_per_case, tests_units, stringency_index, population, population_density, median_age, aged_65_older, aged_70_older, gdp_per_capita, extreme_poverty, cardiovasc_death_rate, diabetes_prevalence, female_smokers, male_smokers, handwashing_facilities, hospital_beds_per_thousand, life_expectancy, human_development_index
我感兴趣的列是 2,3 和 15(从零开始计数)。 但是我不想要来自其他国家的数据。
在您的代码中,您只需创建字典“嵌套”部分的一个副本,然后在您的cases_by_day
字典中的所有情况下都指向同一个实例。 因此,您只有同一事物的多个副本(引用)。 这是问题行:
cases_by_day[str(elements[3])] = location
我会建议几件事。 如果您想保留 data[day][country] 的格式并具有“零”的表示形式,那么每次您即时找到新日期时只需制作一个新的(空)字典。 然后你只需要读取一次文件。 你很接近。
根据您想要对数据执行的操作, pandas
解决方案可能会有所帮助,如果您想访问字典,请继续使用上面的修复程序,如果您遇到困难,请回复评论!
我会研究模块 pandas
import pandas as pd
df = pd.read_csv('data.csv')
cols =[' continent', ' location', ' new_deaths_per_million']
subset = ['list of countries needed']
dff = df.loc[df[' location'].isin(subset)]
dff[cols].to_csv('nameofyourfile.csv)
我使它适用于以下代码编辑(仅重要部分)
italy = dict()
sweden= dict()
germany= dict()
france = dict()
cases_by_day = dict()
with open("data.csv") as f:
v = f.readlines()
for line in v:
elements = line.split(",")
# print(elements)
date = datetime.strptime(elements[3], "%Y-%m-%d")
italy[str(elements[3])] = 0.0
sweden[str(elements[3])] = 0.0
germany[str(elements[3])] = 0.0
france[str(elements[3])] = 0.0
with open("data.csv") as h:
for line in h:
a = line.split(",")
if "Italy" in a[2]:
u = str(a[3])
if len(a[15]) == 0:
italy[u] = float(0.0)
else:
# print(cases_by_day[u]["Italy"])
# print(u)
italy[u] = float(a[15])
# print(cases_by_day[u]["Italy"])
elif "Sweden" in a[2]:
i = str(a[3])
if len(a[15]) == 0:
sweden[i] = 0.0
else:
sweden[i] = float(a[15])
elif "France" in a[2]:
o = str(a[3])
if len(a[15]) == 0:
france[o] = 0.0
else:
france[o] = float(a[15])
elif "Germany" in a[2]:
p = str(a[3])
if len(a[15]) == 0:
germany[p] = 0.0
else:
germany[p] = float(a[15])
所以我基本上分裂了我的听写。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.