简体   繁体   English

Pythonic方式循环字典

[英]Pythonic way to loop over dictionary

I am practicing Pandas and have the following task: 我正在练习熊猫并完成以下任务:

Create a list whose elements are the # of columns of each .csv file 创建一个列表,其元素是每个.csv文件的列数


.csv files are stored in the dictionary directory keyed by year .csv文件存储在以年为单位的字典directory

I use a dictionary comprehension dataframes (again keyed by year) to store the .csv files as pandas dataframes 我使用字典理解dataframes (再次按年份键入)将.csv文件存储为pandas数据帧

directory = {2009: 'path_to_file/data_2009.csv', ... , 2018: 'path_to_file/data_2018.csv'}

dataframes = {year: pandas.read_csv(file) for year, file in directory.items()}

# My Approach 1 
columns = [df.shape[1] for year, df in dataframes.items()]

# My Approach 2
columns = [dataframes[year].shape[1] for year in dataframes]

Which way is more "Pythonic"? 哪种方式更“Pythonic”? Or is there a better way to approach this? 或者有更好的方法来解决这个问题吗?

Your method will get it done... but I don't like reading in the entire file and creating a dataframe just to count the columns. 您的方法将完成它...但我不喜欢读取整个文件并创建数据帧只是为了计算列。 You could do the same thing by just reading the first line of each file and counting the number of commas. 你可以通过阅读每个文件的第一行并计算逗号的数量来做同样的事情。 Notice that I add 1 because there will always be one less comma than there are columns. 请注意,我添加1是因为总有一个逗号少于列。

columns = [open(f).readline().count(',') + 1 for _, f in directory.items()]

Your Approach 2: 你的方法2:

columns = [dataframes[year].shape[1] for year in dataframes]

is more Pythonic and concise with the future use of dataframes in merging, plotting, manipulating, etc.since the keys are implied in the comprehension and shape gives the number of columns 更加Pythonic和简洁与未来在合并,绘图,操纵等数据帧的使用,因为理解中隐含了键,并且形状给出了列数

You could use: 你可以使用:

columns = [len(dataframe.columns) for dataframe in dataframes.values()]

As @piRSquared mentioned if your only objective is to get the number of columns in the dataframe you shouldn't read the entire csv file, instead use the nrows keyword argument of the read_csv function. 正如@piRSquared所提到的,如果您的唯一目标是获取数据帧中的列数,则不应读取整个csv文件,而应使用read_csv函数的nrows关键字参数。

import os
#use this to find files under certain dir, you can filter it if there are other files
target_files = os.listdir('path_to_file/')       
columns = list()
for filename in train_files:
    #in your scenario @piRSquared's answer would be more efficient.
    columns.append(#column_numbers) 

If you want columns with the key by year from the filename, you can filter the filename and update dictionary like this: 如果您希望文件名中包含年份的列,则可以过滤文件名并更新字典,如下所示:

year = filename.replace(r'[^0-9]', '')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM