Python-根据列对来自多个文件的csv数据进行排序

Question

我有一个包含多个文件的文件夹，每个文件在每个文件中具有不同数量的列。 我想浏览目录，打开每个文件并循环浏览每一行，然后根据该行中的列数将该行写入新的CSV文件。 我想最后针对包含14列的所有行使用单个大CSV，针对包含18列的所有行使用另一个大CSV，以及包含所有其他列的最后一个CSV。

到目前为止，这就是我所拥有的。

import pandas as pd
import glob
import os
import csv


path = r'C:\Users\Vladimir\Documents\projects\ETLassig\W3SVC2'
all_files = glob.glob(os.path.join(path, "*.log")) 

for file in all_files:
    for line in file:
        if len(line.split()) == 14:
            with open('c14.csv', 'wb') as csvfile:
                csvwriter = csv.writer(csvfile, delimiter=' ')
                csvwriter.writerow([line])
        elif len(line.split()) == 18:
            with open('c14.csv', 'wb') as csvfile:
                csvwriter = csv.writer(csvfile, delimiter=' ')
                csvwriter.writerow([line])          
            #open 18.csv
        else:
            with open('misc.csv', 'wb') as csvfile:
                csvwriter = csv.writer(csvfile, delimiter=' ')
                csvwriter.writerow([line])
print(c14.csv)

谁能提供有关如何处理此问题的任何反馈？

Answer 1

您可以将所有列添加为列表中的列表：

l = []
for file in [your_files]:
    with open(file, 'r') as f:
        for line in f.readlines()
            l.appned(line.split(" "))

现在您有了列表列表，因此只需按子列表的长度对其进行排序，然后将其放入新文件中：

l.sort(key=len)

with open(outputfile, 'w'):
     # Write  lines here as you want

Answer 2

在此之前，请注意，您可以复制行作为从输入文件到输出的，不需要在CSV机械。

就是说，我建议使用文件对象的字典和字典的get方法，该方法允许指定默认值。

files = {14:open('14.csv', 'wb'),
         18:open('18.csv', 'wb')}
other = open('other.csv', 'wb')

for file in all_files:
    for line in open(file):
        llen = len(line.split())
        target = files.get(llen, other)
        target.write(line)

如果您必须处理几百万条记录，请注意，因为

In [20]: a = 'a '*20                                                                      

In [21]: %timeit len(a.split())                                                           
599 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [22]: %timeit a.count(' ')+1                                                           
328 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

您应该将上述for循环替换for

for file in all_files:
    for line in open(file):
        fields_count = line.count(' ')+1
        target = files.get(fields_count, other)
        target.write(line)

应该是因为，即使我们说的是纳秒，文件系统的访问也处于相同的状态

In [23]: f = open('dele000', 'w')                                                         

In [24]: %timeit f.write(a)                                                               
508 ns ± 154 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

作为拆分/计数。

Python-根据列对来自多个文件的csv数据进行排序

问题描述

2 个解决方案

解决方案1
5 2018-04-10 11:31:29

解决方案2
0 2018-12-09 08:59:16

Python-根据列对来自多个文件的csv数据进行排序

问题描述

2 个解决方案

解决方案1 5 2018-04-10 11:31:29

解决方案2 0 2018-12-09 08:59:16

解决方案1
5 2018-04-10 11:31:29

解决方案2
0 2018-12-09 08:59:16