简体   繁体   English

在Python中组合异类csv文件的最快/最有效方法

[英]Fastest / most efficient way to combine heterogeneous csv files in Python

I have 1000 csv files with size varying from 8MB to 17MB. 我有1000个csv文件,大小从8MB到17MB不等。 Each file has a portion of 6 metrics. 每个文件都有6个指标的一部分。 Examples are below: 示例如下:

File1 (columns): key, metric1, metric2, metric3
File1 (values):  k1, m1, m2, m3
File2 (columns): key, metric4, metric5, metric6
File2 (values):  k1, m4, m5, m6
File3 (columns): key, metric2, metric4, metric5, metric6
File3 (values):  k2, m2, m4, m5, m6

All methods I tried today combined files into the output below: 我今天尝试的所有方法都将文件合并到以下输出中:

Output (columns): key, metric1, metric2, metric3, metric4, metric5, metric6
Output (values):  key1, m1,   m2,   m3,   null, null, null
                  key1, null, null, null, m4,   m5,   m6
                  key2, null, m2,   null, m4,   m5,   m6

What I really need is also consolidate rows by key column: 我真正需要的也是按键列合并行:

Output (columns): key, metric1, metric2, metric3, metric4, metric5, metric6
Output (values):  key1, m1,   m2, m3,   m4, m5, m6
                  key2, null, m2, null, m4, m5, m6

I know pandas could do it; 我知道pandas可以做到; however, it may take forever to finish 1000 files. 但是,完成1000个文件可能需要很长时间。

It's a bit unclear what your format is, but I think this will work: 尚不清楚您的格式是什么,但我认为这会起作用:

 df = pd.DataFrame(expected_metrics)
 for filename in filelist:
       current_data = pd.read_csv(filename,index_col = 'key')
       current_columns = current_data.columns
       current_row = current_data.index[0] 
       df.loc[current_row,current_columns] = current_data

Notes: 笔记:

-This requires you know ahead of time what metrics will be present so you can initialize expected_metrics . -这需要您提前知道将显示哪些指标,以便可以初始化expected_metrics You could instead replace the last line with: 您可以将最后一行替换为:

 for column in current_columns:
      df.loc[current_row,column] = current_data[column]

This would probably take longer 这可能需要更长的时间

-If a particular (key, metric) combination shows up more than once, only the last one will be recorded. -如果某个特定的(键,公制)组合显示不止一次,则仅记录最后一个。

-The result will have key as the index. -结果将以key作为索引。 If you want it as a data column, you'll have to do df['key'] = df.index . 如果要将其作为数据列,则必须执行df['key'] = df.index

I wouldn't expect this to take "forever"; 我不希望这会永远发生。 a thousand files should take at worst a few minutes unless you have a very large number of metrics. 除非您有大量的指标,否则一千个文件应该花费最少的几分钟。

You could also do: 您也可以这样做:

 data = {}
 for filename in filelist:
       current_data = pd.read_csv(filename,index_col = 'key')
       current_columns = current_data.columns
       current_row = current_data.index[0] 
       data[current_row] = {column:list(current_data[column])[0] for column in current_columns}

This will give a dictionary where each key is a key from your data, and the value is a dictionary representing the row for that key. 这将提供一个字典,其中每个键都是数据中的键,而值是一个代表该键行的字典。

EDIT: A third option would be to take the output you already have and do df.groupby(by='key').max() This will create a dataframe where each entry is the maximum metric for all the rows with the same key. 编辑:第三个选择是获取您已经拥有的输出并执行df.groupby(by='key').max()这将创建一个数据框,其中每个条目都是具有相同键的所有行的最大度量。 So, again, if you have only one value for each (key,metric) combination, this should give you what you want. 因此,再说一次,如果每个(键,度量)组合只有一个值,则应该可以提供所需的值。 If you have more than one value, all but the largest will be ignored. 如果您有多个值,则除最大值外的所有值都将被忽略。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM