繁体   English   中英

如何使用 Python 过滤大型 csv 文件 3

[英]How to filter a large csv file with Python 3

我有这个csv 文件,但我只需要使用 Python 3 过滤我需要的数据。

简而言之,csv 代表了许多汇总的 covid19 数据。 但我只需要其中的一部分。 我需要所有时间段,只需要意大利、瑞典、德国和法国每百万人的新死亡人数,仅此而已。

然后,我想创建另一个 CSV 给定这个证明:

日期,意大利,瑞典,德国,法国

(例如,01-Apr-2020,13.35,4.52,1.22,6.74)

我的代码如下:

    cases_by_day = dict()
location = {'Italy': 0.0, 'Sweden': 0.0, 'France': 0.0, 'Germany': 0.0}

with open("data.csv") as f:
    v = f.readlines()
    for line in v:
        elements = line.split(",")
        # print(elements)
        date = datetime.strptime(elements[3], "%Y-%m-%d")
        cases_by_day[str(elements[3])] = location

with open("data.csv") as h:
    for line in h:
        a = line.split(",")
        if "Italy" in a[2]:
            u = str(a[3])
            if len(a[15]) == 0:
                cases_by_day[u]["Italy"] = 0.0
            else:
                # print(float(a[15]))
                # print(u)
                cases_by_day[u]["Italy"] = float(a[15])
            # print(cases_by_day[u]["Italy"])
        elif "Sweden" in a[2]:
            i = str(a[3])
            if len(a[15]) == 0:
                cases_by_day[i]["Sweden"] = 0.0
            else:
                cases_by_day[i]["Sweden"] = float(a[15])
        elif "France" in a[2]:
            o = str(a[3])
            if len(a[15]) == 0:
                cases_by_day[o]["France"] = 0.0
            else:
                cases_by_day[o]["France"] = float(a[15])
        elif "Germany" in a[2]:
            p = str(a[3])
            if len(a[15]) == 0:
                cases_by_day[p]["Germany"] = 0.0
            else:
                cases_by_day.get(p)["Germany"] = float(a[15])

print(cases_by_day)

但是,在该过程结束时,每个日期键都有相同的嵌套字典,我不知道为什么

编辑: data.csv 的测试如下:

iso_code, continent, location, date, total_cases, new_cases, new_cases_smoothed, total_deaths, new_deaths, new_deaths_smoothed, total_cases_per_million, new_cases_per_million, new_cases_smoothed_per_million, total_deaths_per_million, new_deaths_per_million, new_deaths_smoothed_per_million, 16 reproduction_rate, icu_patients, icu_patients_per_million, hosp_patients, hosp_patients_per_million, weekly_icu_admissions, weekly_icu_admissions_per_million, weekly_hosp_admissions, weekly_hosp_admissions_per_million , total_tests, new_tests, total_tests_per_thousand, new_tests_per_thousand, new_tests_smoothed, new_tests_smoothed_per_thousand, positive_rate, tests_per_case, tests_units, stringency_index, population, population_density, median_age, aged_65_older, aged_70_older, gdp_per_capita, extreme_poverty, cardiovasc_death_rate, diabetes_prevalence, female_smokers, male_smokers, handwashing_facilities, hospital_beds_per_thousand, life_expectancy, human_development_index

我感兴趣的列是 2,3 和 15(从零开始计数)。 但是我不想要来自其他国家的数据。

在您的代码中,您只需创建字典“嵌套”部分的一个副本,然后在您的cases_by_day字典中的所有情况下都指向同一个实例 因此,您只有同一事物的多个副本(引用)。 这是问题行:

cases_by_day[str(elements[3])] = location

我会建议几件事。 如果您想保留 data[day][country] 的格式并具有“零”的表示形式,那么每次您即时找到新日期时只需制作一个新的(空)字典。 然后你只需要读取一次文件。 你很接近。

根据您想要对数据执行的操作, pandas解决方案可能会有所帮助,如果您想访问字典,请继续使用上面的修复程序,如果您遇到困难,请回复评论!

我会研究模块 pandas

import pandas as pd

df = pd.read_csv('data.csv')
cols =[' continent', ' location', ' new_deaths_per_million']
subset = ['list of countries needed']
dff = df.loc[df[' location'].isin(subset)]
dff[cols].to_csv('nameofyourfile.csv)

我使它适用于以下代码编辑(仅重要部分)

italy = dict()
sweden= dict()
germany= dict()
france = dict()
cases_by_day = dict()

with open("data.csv") as f:
    v = f.readlines()
    for line in v:
        elements = line.split(",")
        # print(elements)
        date = datetime.strptime(elements[3], "%Y-%m-%d")
        italy[str(elements[3])] = 0.0
        sweden[str(elements[3])] = 0.0
        germany[str(elements[3])] = 0.0
        france[str(elements[3])] = 0.0

with open("data.csv") as h:
    for line in h:
        a = line.split(",")
        if "Italy" in a[2]:
            u = str(a[3])
            if len(a[15]) == 0:
                italy[u] = float(0.0)
            else:
                # print(cases_by_day[u]["Italy"])
                # print(u)
                italy[u] = float(a[15])
            # print(cases_by_day[u]["Italy"])
        elif "Sweden" in a[2]:
            i = str(a[3])
            if len(a[15]) == 0:
                sweden[i] = 0.0
            else:
                sweden[i] = float(a[15])
        elif "France" in a[2]:
            o = str(a[3])
            if len(a[15]) == 0:
                france[o] = 0.0
            else:
                france[o] = float(a[15])
        elif "Germany" in a[2]:
            p = str(a[3])
            if len(a[15]) == 0:
                germany[p] = 0.0
            else:
                germany[p] = float(a[15])

所以我基本上分裂了我的听写。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM