简体   繁体   English

如何将 csv 文件转换为 Python 字典并根据特定条件求和/操作值?

[英]How to convert csv file into Python dictionary and sum/operate values based on certain conditions?

I have a huge csv file containing information about COVID-19 cases and deaths for every single county in the United States.我有一个巨大的 csv 文件,其中包含有关美国每个县的 COVID-19 病例和死亡人数的信息。

To give you a general idea of the information contained in this file, here are the first 10 lines of it:为了让您大致了解此文件中包含的信息,以下是文件的前 10 行:

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alabama,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alabama,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alabama,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alabama,01017,3460,113,1720,73,1740,40

Anyways, I want to create a Python dictionary for this data with each of the keys being a tuple consisting of the state and county names and the each of the values being a list of integers: the first int representing the number of confirmed cases and the second one representing the number of confirmed deaths.无论如何,我想为这些数据创建一个 Python 字典,其中每个键都是一个由州和县名称组成的元组,每个值都是一个整数列表:第一个 int 表示确诊病例的数量和第二个代表确诊的死亡人数。

Basically, I want output like this:基本上,我想要这样的输出:

dic = {(“state”,“county”):[confirmed_cases, confirmed_deaths]}

How would I generate a python dictionary like the one above for all the counties in the csv file?我将如何为 csv 文件中的所有县生成一个像上面那样的 python 字典? Please use csv.reader.请使用 csv.reader。

Additionally, I need to find the sum of all the confirmed deaths for a particular state.此外,我需要找到特定州所有已确认死亡人数的总和。 How would I, for example, sum up the values in 'confirmed deaths' for all of the rows where 'state' is 'Alabama'?例如,我将如何总结“州”为“阿拉巴马州”的所有行的“已确认死亡人数”中的值?

I also want to find number of counties in the entire file where the number of confirmed cases is between certain intervals.我还想在整个文件中找到确诊病例数在特定时间间隔之间的县数。 How would I go about doing that in Python?我将如何在 Python 中做到这一点?

Additionally, I have another csv file with all the population values for each state for the past couple of years.此外,我还有另一个 csv 文件,其中包含过去几年每个州的所有人口值。 To get an idea of the info this file contains, here are the first 10 lines:要了解此文件包含的信息,以下是前 10 行:

table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts),,,,,,,,,,,,表在 A 列中带有行标题,在第 3 行到第 4 行中带有列标题。(前导点表示子部分)、、、、、、、、、、、、

.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749

The most recent population estimate is represented by the last value for each state.最近的人口估计由每个州的最后一个值表示。 For example, for Alabama, '4903185' is the most recent population estimate.例如,对于阿拉巴马州,'4903185' 是最近的人口估计值。

I want to create a python dictionary with the keys being strings representing each state name and the values being an integer representing only the last number for each state in the csv file.我想创建一个 python 字典,键是表示每个状态名称的字符串,值是一个整数,仅表示 csv 文件中每个状态的最后一个数字。

Basically, {“state”:population”}基本上,{“状态”:人口”}

And to add on to that, I would like to calculate the death density rate for each state by dividing the number of confirmed deaths in each state and dividing them by the most recent population figure for each state.除此之外,我想通过将每个州的确诊死亡人数除以每个州的最新人口数字来计算每个州的死亡率。 How would I program that?我将如何编程?

And finally, I want to calculate the us death density rate (entire country) by adding all of the values in the confirmed deaths column and dividing that sum by the sum of all of the values in the latest population column.最后,我想通过添加已确认死亡列中的所有值并将该总和除以最新人口列中所有值的总和来计算美国死亡率(整个国家/地区)。 How would I do that?我该怎么做?

Thank you!谢谢!

I highly recommend checking out pandas .我强烈建议您查看pandas It is a wonderful library based all around tabular data.它是一个基于表格数据的绝妙库。 There are specific commands in it to read a csv, then send it to json, making it easily manipulatable in Python.其中有特定的命令可以读取 csv,然后将其发送到 json,使其可以在 Python 中轻松操作。

It would go something like:它会是这样的:

import pandas as pd

df = pd.read_csv("filename.csv")
json_variable = df.to_json()
print(json_variable)

Of course, you can format this with pretty print, but if you already have the CSV, all you should need to do now is access that "json_variable".当然,你可以用漂亮的打印格式来格式化它,但如果你已经有了 CSV,你现在需要做的就是访问“json_variable”。

Better use pandas.DataFrame instead of dictionary - it will be more useful.最好使用pandas.DataFrame而不是dictionary - 它会更有用。

I use io.StringIO only to simulate file in memory.我只使用io.StringIO来模拟内存中的文件。

I changed few states in example data to have more then one state我在示例数据中更改了几个状态以具有多个状态

Using groupby(['state', 'county']) I could get expected values and create dictionary but it is more useful to use directly values.使用groupby(['state', 'county'])我可以获得预期值并创建字典,但直接使用值更有用。

text = '''date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alaska,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alaska,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alaska,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alaska,01017,3460,113,1720,73,1740,40'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text))
#df = pd.read_csv("filename.csv")

for name, group in df.groupby(['state', 'county']):
    print('\n---', name, '---\n')
    #print(group)
    print('confirmed_cases :', group['confirmed_cases'].values[-1])
    print('confirmed_deaths:', group['confirmed_deaths'].values[-1])

Result:结果:

--- ('Alabama', 'Autauga') ---

confirmed_cases: 5557
confirmed_deaths: 85

--- ('Alabama', 'Baldwin') ---

confirmed_cases: 14329
confirmed_deaths: 220

--- ('Alabama', 'Barbour') ---

confirmed_cases: 1225
confirmed_deaths: 37

--- ('Alabama', 'Bibb') ---

confirmed_cases: 2031
confirmed_deaths: 35

--- ('Alabama', 'Blount') ---

confirmed_cases: 4901
confirmed_deaths: 109

--- ('Alaska', 'Bullock') ---

confirmed_cases: 1059
confirmed_deaths: 29

--- ('Alaska', 'Butler') ---

confirmed_cases: 1888
confirmed_deaths: 60

--- ('Alaska', 'Calhoun') ---

confirmed_cases: 10608
confirmed_deaths: 242

--- ('Alaska', 'Chambers') ---

confirmed_cases: 1720
confirmed_deaths: 73

The same way you can groupby('state') and calculate sum() of confirmed_deaths同样的方式,你可以groupby('state')并计算confirmed_deaths sum()

for name, group in df.groupby('state'):
    print('\n---', name, '---\n')
    print('confirmed_deaths:', group['confirmed_deaths'].sum())

Result:结果:

--- Alabama ---

confirmed_deaths: 486

--- Alaska ---

confirmed_deaths: 404

Or shorter或更短

df_state = df.groupby('state').sum()
print(df_state)

And you have new DataFrame你有新的DataFrame

         fips  cases  deaths  ...  confirmed_deaths  probable_cases  probable_deaths
state                         ...                                                   
Alabama  5025  37828     632  ...               486            9785              146
Alaska   4056  20859     519  ...               404            5584              115

which uses state as row's index so you could use it to join with other CSV .它使用state作为行的索引,因此您可以使用它来连接其他CSV

text2 = '''.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
'''

import pandas as pd
import io

df2 = pd.read_csv(io.StringIO(text2), header=None)

# convert first column into index - and remove `.`
df2.index = df2[0].str.replace('.', '')

# get name of last column
last_column = df2.columns[-1]

print(df2[[last_column]])

Result:结果:

                            12
0                             
Alabama                4903185
Alaska                  731545
Arizona                7278717
Arkansas               3017804
California            39512223
Colorado               5758736
Connecticut            3565287
Delaware                973764
District of Columbia    705749

And join both并加入两者

df3 = df_state.join(df2[[last_column]])

print(df3)

to get要得到

         fips  cases  deaths  ...  probable_cases  probable_deaths       12
state                         ...                                          
Alabama  5025  37828     632  ...            9785              146  4903185
Alaska   4056  20859     519  ...            5584              115   731545

And then I can calculate density for every state然后我可以计算每个州的density

density = df3['confirmed_deaths']/df3.iloc[:,-1]

print('density:', density)

Result:结果:

Alabama    0.000099
Alaska     0.000552

And the same for all country所有国家都一样

density = df3['confirmed_deaths'].sum() / df3.iloc[:,-1].sum()

print('density:', density)

Result:结果:

density: 0.00015794900554241286

With pandas.DataFrame I could do this without for -loops.使用pandas.DataFrame我可以在没有for循环的情况下做到这一点。


EDIT:编辑:

Full example with small changes小改动的完整示例

text1 = '''date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alaska,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alaska,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alaska,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alaska,01017,3460,113,1720,73,1740,40'''

text2 = '''.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text1))

# ---

for name, group in df.groupby(['state', 'county']):
    print('\n---', name, '---\n')
    #print(group)
    print('confirmed_cases:', group['confirmed_cases'].values[0])
    print('confirmed_deaths:', group['confirmed_deaths'].values[0])

for name, group in df.groupby('state'):
    print('\n---', name, '---\n')
    print('confirmed_deaths:', group['confirmed_deaths'].sum())

print('\n--- df_state ---\n')
df_state = df.groupby('state').sum()
print(df_state)

# ---

df_population = pd.read_csv(io.StringIO(text2), header=None)

# ---

df_population.index = df_population[0].str.replace('.', '', regex=False)
# get last column
last_column = df_population.columns[-1]
# rename last column
df_population = df_population.rename(columns={last_column: 'last_population'})

# ---

print('\n--- df_population ---\n')
print(df_population['last_population'])

df_joined = df_state.join(df_population['last_population'])
print('\n--- df_joined ---\n')
print(df_joined)

density_states = df_joined['confirmed_deaths']/df_joined['last_population']
print('\n--- density states ---\n')
print(density_states)

density_country = df_joined['confirmed_deaths'].sum() / df_joined['last_population'].sum()
print('\n--- density country ---\n')
print(density_country)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM