简体   繁体   English

将 CSV 值读入列表字典的最 Pythonic 方式

[英]Most Pythonic way to read CSV values into dict of lists

I have a CSV file with headers at the top of columns of data as:我有一个 CSV 文件,标题位于数据列的顶部,如下所示:

a,b,c
1,2,3
4,5,6
7,8,9

and I need to read it in a dict of lists:我需要在列表字典中阅读它:

desired_result = {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}

When reading this with DictReader I am using a nested loop to append the items to the lists:使用DictReader阅读本文时,我使用嵌套循环将项目附加到列表中:

f = 'path_to_some_csv_file.csv'
dr = csv.DictReader(open(f))
dict_of_lists = dr.next()
for k in dict_of_lists.keys():
    dict_of_lists[k] = [dict_of_lists[k]]
for line in dr:
    for k in dict_of_lists.keys():
        dict_of_lists[k].append(line[k])

The first loop sets all values in the dict to the empty list.第一个循环将 dict 中的所有值设置为空列表。 The next one loops over every line read in from the CSV file, from which DictReader creates a dict of key-values.下一个循环遍历从 CSV 文件读入的每一行, DictReader创建一个键值字典。 The inner loop appends the value to list matching the corresponding key, so I wind up with the desired list of dicts.内部循环将值附加到与相应键匹配的列表中,所以我最终得到了所需的字典列表。 I end up having to write this fairly often.我最终不得不经常写这个。

My question is, is there a more Pythonic way of doing this using built-in functions without the nested loop, or a better idiom, or an alternative way to store this data structure such that I can return an indexable list by querying with a key?我的问题是,是否有更 Pythonic 的方式使用没有嵌套循环的内置函数来执行此操作,或者更好的习惯用法,或者存储此数据结构的替代方法,以便我可以通过使用键查询来返回可索引列表? If so is there also a way to format the data being ingested by column upfront?如果是这样,是否还有一种方法可以预先格式化由列摄取的数据?

Depending on what type of data you're storing and if you're ok with using numpy, a good way to do this can be with numpy.genfromtxt :根据您存储的数据类型以及是否可以使用 numpy,一个很好的方法是使用numpy.genfromtxt

import numpy as np
data = np.genfromtxt('data.csv', delimiter=',', names=True)

What this will do is create a numpy Structured Array , which provides a nice interface for querying the data by header name (make sure to use names=True if you have a header row).这将创建一个 numpy Structured Array ,它提供了一个很好的界面,用于按标题名称查询数据(如果您有标题行,请确保使用names=True )。

Example, given data.csv containing:示例,给定data.csv包含:

a,b,c
1,2,3
4,5,6
7,8,9

You can then access elements with:然后,您可以通过以下方式访问元素:

>>> data['a']        # Column with header 'a'
array([ 1.,  4.,  7.])
>>> data[0]          # First row
(1.0, 2.0, 3.0)
>>> data['c'][2]     # Specific element
9.0
>>> data[['a', 'c']] # Two columns
array([(1.0, 3.0), (4.0, 6.0), (7.0, 9.0)],
      dtype=[('a', '<f8'), ('c', '<f8')])

genfromtext also provides a way, as you requested, to "format the data being ingested by column up front." genfromtext还提供了一种方法,根据您的要求,“ genfromtext格式化由列摄取的数据”。

converters : variable, optional转换器变量,可选

The set of functions that convert the data of a column to a value.将列的数据转换为值的一组函数。 The converters can also be used to provide a default value for missing data: converters = {3: lambda s: float(s or 0)} .转换器还可用于为缺失数据提供默认值: converters = {3: lambda s: float(s or 0)}

If you're willing to use a third-party library, then the merge_with function from Toolz makes this whole operation a one-liner:如果您愿意使用第三方库,那么来自Toolzmerge_with函数使整个操作成为一个单线

dict_of_lists = merge_with(list, *csv.DictReader(open(f)))

Using only the stdlib, a defaultdict makes the code less repetitive:仅使用 stdlib, defaultdict使代码减少重复:

from collections import defaultdict
import csv

f = 'test.csv'

dict_of_lists = defaultdict(list)
for record in DictReader(open(f)):
    for key, val in record.items():    # or iteritems in Python 2
        dict_of_lists[key].append(val)

If you need to do this often, factor it out into a function, eg transpose_csv .如果您需要经常这样做,请将其分解为一个函数,例如transpose_csv

Nothing wrong with ford's answer, I'll just add mine here (which makes use of the csv library)福特的回答没有问题,我只是在这里添加我的(它使用了 csv 库)

with open(f,'r',encoding='latin1') as csvf:
    dialect = csv.Sniffer().sniff(csvf.readline()) # finds the delimiters automatically
    csvf.seek(0)
    # read file with dialect
    rdlistcsv = csv.reader(csvf,dialect)
    # save to list of rows
    rowslist  = [list(filter(None,line)) for line in rdlistcsv]
    header = rowslist[0]
    data = {}
    for i,key in enumerate(header):
        ilist = [row[i] for row in rowslist]
        data.update({key: ilist})

EDIT : actually, if you do not mind using pandas things get way easier with it:编辑:实际上,如果你不介意使用熊猫,事情会变得更容易:

  1. import pandas进口大熊猫

    import pandas as pd
  2. import file and save it as pandas dataframe导入文件并将其保存为熊猫数据框

    df = pd.read_csv(inputfile)
  3. turn df into a dictionary将 df 变成字典

    mydict = df.to_ditc(orient='list')

This way you use the csv header to define the keys and for each key you have a list of elements (something like an excel column turned to a list)通过这种方式,您可以使用 csv 标题来定义键,并且对于每个键,您都有一个元素列表(类似于 Excel 列变成了列表)

You can use dict and set comprehensions to make your intent more obvious:您可以使用 dict 和 set comprehensions 使您的意图更加明显:

dr=csv.DictReader(f)
data={k:[v] for k, v in dr.next().items()}             # create the initial dict of lists
for line_dict in dr:
    {data[k].append(v) for k, v in line_dict.items()}  # append to each

You can use Alex Martelli's method to flatten a list of lists in Python to flatten an iterator of iterators, which further reduces the first form to:您可以使用Alex Martelli 的方法在 Python 中展平列表列表以展平迭代器的迭代器,这将第一种形式进一步简化为:

dr=csv.DictReader(f)
data={k:[v] for k, v in dr.next().items()}
{data[k].append(v) for line_dict in dr for k, v in line_dict.items()}

On Python 2.X, consider using {}.iteritems vs {}.items() if your csv file is sizable.在 Python 2.X 上,如果您的 csv 文件很大,请考虑使用{}.iteritems{}.items()


Further example:进一步的例子:

Assume this csv file:假设这个 csv 文件:

Header 1,Header 2,Header 3
1,2,3
4,5,6
7,8,9

Now suppose you want a dict of lists of each value converted to a float or int.现在假设您想要将每个值的列表的字典转换为浮点数或整数。 You can do:你可以做:

def convert(s, converter):
    try:
        return converter(s)
    except Exception:
        return s    

dr=csv.DictReader(f)
data={k:[convert(v, float)] for k, v in dr.next().items()}
{data[k].append(convert(v, float)) for line_dict in dr for k, v in line_dict.items()}

print data
# {'Header 3': [3.0, 6.0, 9.0], 'Header 2': [2.0, 5.0, 8.0], 'Header 1': [1.0, 4.0, 7.0]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM