简体   繁体   English

要将CSV文件读入字典?

[英]Reading a CSV file into a dictionary?

First I want to start off by saying I am NOT asking you to write code. 首先,我想说一句我要求您编写代码。 I only want to discuss and get feed back on what would be the best way to go about writing this program because I am stuck on figuring out how to break down the problem. 我只想讨论并反馈什么是编写此程序的最佳方法,因为我一直想弄清楚如何解决问题。

My program is supposed to open a CSV file which contains 7 columns: 我的程序应该打开一个包含7列的CSV文件:

Name of the state,Crop,Crop title,Variety,Year,Unit,Value. 

Here is part of the file: 这是文件的一部分:

Indiana,Corn,Genetically engineered (GE) corn,Stacked gene varieties,2012,Percent of all corn planted,60
Indiana,Corn,Genetically engineered (GE) corn,Stacked gene varieties,2013,Percent of all corn planted,73
Indiana,Corn,Genetically engineered (GE) corn,Stacked gene varieties,2014,Percent of all corn planted,78
Indiana,Corn,Genetically engineered (GE) corn,Stacked gene varieties,2015,Percent of all corn planted,76
Indiana,Corn,Genetically engineered (GE) corn,Stacked gene varieties,2016,Percent of all corn planted,75
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2000,Percent of all corn planted,11
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2001,Percent of all corn planted,12
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2002,Percent of all corn planted,13
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2003,Percent of all corn planted,16
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2004,Percent of all corn planted,21
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2005,Percent of all corn planted,26
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2006,Percent of all corn planted,40
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2007,Percent of all corn planted,59
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2008,Percent of all corn planted,78
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2009,Percent of all corn planted,79
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2010,Percent of all corn planted,83
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2011,Percent of all corn planted,85
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2012,Percent of all corn planted,84
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2013,Percent of all corn planted,85
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2014,Percent of all corn planted,88
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2015,Percent of all corn planted,88
Indiana,Corn,Genetically engineered (GE) corn,All GE varieties,2016,Percent of all corn planted,86

Then read each line into a dictionary. 然后将每一行读入字典。 There are many many lines in this text file, the only lines I want/need are the lines whose Variety column reads "All GE varieties." 该文本文件中有很多行,我想要/需要的唯一行是Variety列中显示“所有GE品种”的行。 Please note each state also has multiple lines. 请注意,每个州也有多行。 The next step is to use a user input of a crop and only examine the data for that crop. 下一步是使用农作物的用户输入,仅检查该农作物的数据。 The final step is to then figure out (for each state) what is the max and min value and its corresponding year and print it. 最后一步是(针对每个州)找出最大和最小值及其对应的年份并打印出来。

The way I was thinking of going about this was possibly creating a set for each line, checking if "All GE varieties" was in the set and if it is then add that to a dictionary. 我考虑的方式可能是为每行创建一个集合,检查“所有GE品种”是否在集合中,然后将其添加到字典中。 And then do something similar for the crop? 然后对农作物做类似的事情?

My biggest dilemma is probably that 1.) I don't know how to go about ignoring lines that don't contain "All GE varieties." 我最大的难题可能是1.)我不知道如何忽略不包含“所有GE品种”的行。 Do I do that before or after I create the dictionary? 我在创建字典之前或之后要这样做吗? and 2.) I know how to create a dictionary with one value and one key, but how would I go about adding the rest of the values to the key? 和2.)我知道如何用一个值和一个键来创建字典,但是我将如何将其余的值添加到键中呢? Do you do that with sets? 你用布套做吗? or lists? 或清单?

Figuring out if "All GE varieties" is in the string is relatively straightforward - use the in keyword: 确定字符串中是否包含“所有GE品种”相对简单-使用in关键字:

with open(datafile, 'r') as infile:
    for line in file:
        if "All GE varieties" in line:
            # put into line into data structure

For the data structure, I'm partial to lists of dictionaries, where each dictionary has a defined set of keys: 对于数据结构,我偏爱词典列表,其中每个词典都有一组定义的键:

myList = [ {}, {}, {}, ... ]

The problem in this case is I'm not sure what you would use as the key, if each field is a value. 在这种情况下,问题是如果每个字段都是一个值,我不确定您将使用什么作为键。 Also remember the split() command can help: 还请记住split()命令可以帮助您:

varieties = []
with open(datafile, 'r') as infile:
    for line in file:
        if "All GE varieties" in line:
            varieties.append(line.split(','))

This would give you a list (varieties) containing lists, each of which single fields from each line. 这将为您提供一个包含列表的列表(变量),每个列表中的每一行都包含单个字段。

Something like this: 像这样:

varieties = [['Indiana','Corn','Genetically engineered (GE) corn','All GE varieties','2000','Percent of all corn planted','11'], ['Indiana','Corn','Genetically engineered (GE) corn','All GE varieties','2001','Percent of all corn planted','12'], ... ]

From here it would be fairly easy to pick out the state or year, etc. using slices (2D array). 从这里可以很容易地使用切片(2D数组)来选择状态或年份等。

As previously mentioned, you can use the csv module to read in the csv file. 如前所述,您可以使用csv模块读取csv文件。 I wasn't exactly sure how you wanted the data structured after the state key but I thought it might be nicer to be able to look up each particular crop_title and then be able to access the value for each year separately. 我不确定您要如何在state键后构造数据,但我认为最好能够查找每个特定的crop_title ,然后分别访问每年的value

In[33]: from collections import defaultdict
   ...: from csv import reader
   ...: 
   ...: crops = defaultdict(lambda: defaultdict(dict))
   ...: with open('hmm.csv', 'r') as csvfile:
   ...:     cropreader = reader(csvfile)
   ...:     for row in cropreader:
   ...:         state, crop_type, crop_title, variety, year, unit, value = row
   ...:         if variety == 'All GE varieties':
   ...:             crops[state][crop_title][year] = value
   ...: 
In[34]: crops
Out[34]: 
defaultdict(<function __main__.<lambda>>,
            {'Indiana': defaultdict(dict,
                         {'Genetically engineered (GE) corn': {'2000': '11',
                           '2001': '12',
                           '2002': '13',
                           '2003': '16',
                           '2004': '21',
                           '2005': '26',
                           '2006': '40',
                           '2007': '59',
                           '2008': '78',
                           '2009': '79',
                           '2010': '83',
                           '2011': '85',
                           '2012': '84',
                           '2013': '85',
                           '2014': '88',
                           '2015': '88',
                           '2016': '86'}})})
In[35]: crops['Indiana']['Genetically engineered (GE) corn']['2000']
Out[35]: '11'
In[36]: crops['Indiana']['Genetically engineered (GE) corn']['2015']
Out[36]: '88'

You could also convert year and value into integers like this crops[state][crop_title][int(year)] = int(value) which would allow you to make calls like this (where the return value is an integer): 您还可以将yearvalue转换为整数,例如以下crops[state][crop_title][int(year)] = int(value) ,这将允许您进行如下调用(返回值是整数):

In[38]: crops['Indiana']['Genetically engineered (GE) corn'][2015]
Out[38]: 88

I put your data into a file named "crop_data.csv". 我将您的数据放入名为“ crop_data.csv”的文件中。 Here's some code that uses the standard csv module to read each line into its own dictionary. 这是一些使用标准csv模块将每一行读入其自己的字典中的代码。 We use a simple if test to make sure we only keep lines where 'Variety' == 'All GE varieties' , and we store the data for each state in all_data , which is a dictionary of lists, one list per state. 我们使用一个简单的if测试来确保仅保留'Variety' == 'All GE varieties' all_data 'Variety' == 'All GE varieties' ,并将每个州的数据存储在all_data ,后者是列表的字典,每个州一个列表。 Since the state 'Name' is used as the key in all_data we don't need to keep it in the row dict, similarly we can discard the 'Variety', since we don't need that info anymore. 由于状态“名称”用作all_data的键, all_data我们不需要将其保留row字典中,因此类似地,我们可以丢弃“变量”,因为我们不再需要该信息。

After all the data is gathered we can print it nicely using the json module. 收集完所有数据后,我们可以使用json模块很好地打印它。

Then we loop over all_data , state by state, and calculate its maximum and minimum. 然后我们遍历all_data ,逐个状态循环,并计算其最大值和最小值。

import csv
from collections import defaultdict
import json

filename = 'crop_data.csv'

fieldnames = 'Name,Crop,Title,Variety,Year,Unit,Value'.split(',')

all_data = defaultdict(list)

with open(filename) as csvfile:
    reader = csv.DictReader(csvfile, fieldnames=fieldnames)
    for row in reader:
        # We only want 'All GE varieties'
        if row['Variety'] == 'All GE varieties':
            state = row['Name']
            # Get rid of unneeded fields
            del row['Name'], row['Variety']
            # Store it as a plain dict
            all_data[state].append(dict(row))

# Show all the data
print(json.dumps(all_data, indent=4))

#Find minimums & maximums

# Extract the 'Value' field from dict d and convert it to a number
def value_key(d):
    return int(d['Value'])

for state, data in all_data.items():
    print(state)
    row = min(data, key=value_key)
    print('min', row['Value'], row['Year'])

    row = max(data, key=value_key)
    print('max', row['Value'], row['Year'])

output 产量

{
    "Indiana": [
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2000",
            "Unit": "Percent of all corn planted",
            "Value": "11"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2001",
            "Unit": "Percent of all corn planted",
            "Value": "12"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2002",
            "Unit": "Percent of all corn planted",
            "Value": "13"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2003",
            "Unit": "Percent of all corn planted",
            "Value": "16"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2004",
            "Unit": "Percent of all corn planted",
            "Value": "21"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2005",
            "Unit": "Percent of all corn planted",
            "Value": "26"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2006",
            "Unit": "Percent of all corn planted",
            "Value": "40"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2007",
            "Unit": "Percent of all corn planted",
            "Value": "59"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2008",
            "Unit": "Percent of all corn planted",
            "Value": "78"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2009",
            "Unit": "Percent of all corn planted",
            "Value": "79"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2010",
            "Unit": "Percent of all corn planted",
            "Value": "83"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2011",
            "Unit": "Percent of all corn planted",
            "Value": "85"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2012",
            "Unit": "Percent of all corn planted",
            "Value": "84"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2013",
            "Unit": "Percent of all corn planted",
            "Value": "85"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2014",
            "Unit": "Percent of all corn planted",
            "Value": "88"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2015",
            "Unit": "Percent of all corn planted",
            "Value": "88"
        },
        {
            "Crop": "Corn",
            "Title": "Genetically engineered (GE) corn",
            "Year": "2016",
            "Unit": "Percent of all corn planted",
            "Value": "86"
        }
    ]
}
Indiana
min 11 2000
max 88 2014

Note that in this data there are 2 years with the value of 88. You could use a fancier key function than value_key if you want to break ties by year. 请注意,在该数据中有2年,值为88。如果您想按年份打破value_key可以使用比value_key更好的键功能。 Or you can use value_key to sort the whole state data list, so you can easily extract all the lowest and highest records. 或者,您可以使用value_key对整个状态data列表进行排序,以便轻松提取所有最低和最高记录。 Eg, in that for state, data loop do 例如, for state, data循环会

data.sort(key=value_key)
print(json.dumps(data, indent=4))

and it will print all the records for that state in numerical order. 它将按数字顺序打印该状态的所有记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM