简体   繁体   English

如何将csv文件转换为python字典?

[英]How to convert csv file to python dictionary?

I have a huge csv file containing information about COVID-19 cases and deaths for every single county in the United States.我有一个巨大的 csv 文件,其中包含有关美国每个县的 COVID-19 病例和死亡人数的信息。

To give you a general idea of the information contained in this file, here are the first 10 lines of it:为了让您大致了解此文件中包含的信息,以下是文件的前 10 行:

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alabama,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alabama,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alabama,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alabama,01017,3460,113,1720,73,1740,40

Anyways, I want to create a Python dictionary for this data with each of the keys being a tuple consisting of the state and county names and the each of the values being a list of integers: the first int representing the number of confirmed cases and the second one representing the number of confirmed deaths.无论如何,我想为这些数据创建一个 Python 字典,其中每个键都是一个由州和县名称组成的元组,每个值都是一个整数列表:第一个int表示确诊病例的数量和第二个代表确诊的死亡人数。

Basically, I want output like this:基本上,我想要这样的输出:

dic = {("state","county"):[confirmed_cases, confirmed_deaths]}

Please make sure to exclude the header.请确保排除标题。

How would I generate a python dictionary like the one above for all the counties in the csv file?我将如何为 csv 文件中的所有县生成一个像上面那样的 python 字典? Please use csv.reader .请使用csv.reader

Additionally, I need to find the sum of all the confirmed deaths for a particular state.此外,我需要找到特定州所有已确认死亡人数的总和。 How would I, for example, sum up the values in 'confirmed deaths' for all of the rows where 'state' is 'Alabama'?例如,我将如何总结“州”为“阿拉巴马州”的所有行的“已确认死亡人数”中的值?

EDIT: I came up with a solution for the first part of the problem:编辑:我想出了问题的第一部分的解决方案:

mydict = {}

with open(file_path, mode='r') as inp:
    reader = csv.reader(inp)
    next(reader,None)
    mydict = {tuple(row[1:3]):list(row[6:8]) for row in reader}

return mydict

Can you help me figure out how to sum up confirmed deaths in a certain state based on this dictionary?你能帮我算出如何根据这本字典总结某个州的确诊死亡人数吗?

Try:尝试:

import csv

mydict = dict()
with open("test.csv") as inp:
    reader = csv.reader(inp)
    next(reader, None) #skip header
    mydict = {tuple(row[2:0:-1]): list(map(int, row[6:8])) for row in reader}

#total of all confirmed deaths in Alabama
>>> sum(v[1] for k, v in mydict.items() if k[0]=="Alabama")
890
test.csv:测试.csv:
date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alabama,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alabama,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alabama,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alabama,01017,3460,113,1720,73,1740,40

I think pandas is the most appropriate solution:我认为熊猫是最合适的解决方案:

import pandas as pd
df = pd.read_csv(file_path)
dict = df.set_index(['county','state'])[['confirmed_cases', 'confirmed_deaths']].apply(tuple, axis = 1).to_dict()
print(dict)

EDIT编辑

for the sum part:对于总和部分:

sum = df.groupby(['state'], as_index=False)['confirmed_cases', 'confirmed_deaths'].sum()
print(sum)

I would actually do it a different way, albeit a bit more verbose but more readable by someone looking at the code.我实际上会用不同的方式来做,虽然有点冗长,但对于查看代码的人来说更具可读性。

import csv
from collections import namedtuple

County = namedtuple("County", ["name", "cases", "deaths"])
reader = csv.DictReader(data)

for row in reader:
    state = row["state"]
    county = row["county"]
    record = County(county, int(row["confirmed_cases"]), int(row["confirmed_deaths"]))
    if state in states:
        states[state].append(record)
    else:
        states[state] = [record]

{'Alabama': [County(name='Autauga', cases=5557, deaths=85),
  County(name='Baldwin', cases=14329, deaths=220),
  County(name='Barbour', cases=1225, deaths=37),
  County(name='Bibb', cases=2031, deaths=35),
  County(name='Blount', cases=4901, deaths=109),
  County(name='Bullock', cases=1059, deaths=29),
  County(name='Butler', cases=1888, deaths=60),
  County(name='Calhoun', cases=10608, deaths=242),
  County(name='Chambers', cases=1720, deaths=73)]}

sum(county.deaths for county in states["Alabama"])
>> 890

It will be easier to manage your code if you keep the key simple, in this case just the state.如果您保持密钥简单,则管理代码会更容易,在这种情况下只是状态。 This will also be quicker if your data is larger since we won't have to iterate over tuple keys in the dictionary to grab the state we want.如果您的数据更大,这也会更快,因为我们不必遍历字典中的元组键来获取我们想要的状态。

I would suggest a dictionary of dictionary.我会建议字典字典。 It then becomes easier to get totals by state or county.然后,按州或县获得总数变得更容易。 I used 2 dictionaries: dd is the dict of dict and deaths is a dictionary built from the dd dictionary to get the deaths by state.我使用了 2 个字典: dd是 dict 的 dict,而deaths是从dd字典构建的字典,用于按州获取死亡人数。

import csv
from collections import defaultdict

dd = defaultdict(dict)

with open('csv_11_09_01.csv', 'r') as f:
    reader = csv.DictReader(f)
    for record in reader:
        dd[ record['state'] ][ record['county'] ] = \
           [
               int(record['confirmed_cases']),
               int(record['confirmed_deaths'])
            ]

deaths = defaultdict(int)
for state, county in dd.items():
    deaths[state] += sum((confirmed_deaths
                          for _, confirmed_deaths in county.values()))

print('Alabama confirmed_deaths:', deaths['Alabama'])

total_confirmed = 0
for county in dd.values():
    for confirmed_cases, _ in county.values():
        if 1000 < confirmed_cases < 5000:
            total_confirmed += 1
    
print('number of counties in US with', 
      'confirmed cases between 1000 and 5000:', total_confirmed)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM