简体   繁体   English

用python很好地连接csv文件

[英]concatenating csv files nicely with python

My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing . 我的程序首先将一个大数据集聚类为100个聚类,然后使用multiprocessing在数据集的每个聚类上运行一个模型。 My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models. 我的目标是将所有输出值连接到一个大的csv文件中,该文件是来自100个拟合模型的所有输出数据的连接。

For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file. 现在,我只创建100个csv文件,然后在包含这些文件的文件夹中循环,并将它们一个接一个地复制到一个大文件中。

My question: is there a smarter method to get this big output file without exporting 100 files. 我的问题:是否有一种更聪明的方法来获取此大输出文件而不导出100个文件。 I use pandas and scikit-learn for data processing, and multiprocessing for parallelization. 我使用pandasscikit-learn进行数据处理,并使用multiprocessing进行并行化。

If all of your partial csv files have no headers and share column number and order, you can concatenate them like this: 如果您所有的部分csv文件都没有标题,并且共享列号和顺序,则可以将它们连接起来,如下所示:

with open("unified.csv", "w") as unified_csv_file:
    for partial_csv_name in partial_csv_names:
        with open(partial_csv_name) as partial_csv_file:
            unified_csv_file.write(partial_csv_file.read())

have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv. 让您的处理线程将数据集返回到主进程,而不是自己编写csv文件,然后当它们将数据返回给主进程时,让其将它们写到一个连续的csv中。

from multiprocessing import Process, Manager

def worker_func(proc_id,results):

    # Do your thing

    results[proc_id] = ["your dataset from %s" % proc_id]

def convert_dataset_to_csv(dataset):

    # Placeholder example.  I realize what its doing is ridiculous

    converted_dataset = [ ','.join(data.split()) for data in dataset]
    return  converted_dataset

m = Manager()
d_results= m.dict()

worker_count = 100

jobs = [Process(target=worker_func,
        args=(proc_id,d_results))
        for proc_id in range(worker_count)]

for j in jobs:
    j.start()

for j in jobs:
    j.join()


with open('somecsv.csv','w') as f:

    for d in d_results.values():

        # if the actual conversion function benefits from multiprocessing,
        # you can do that there too instead of here

        for r in convert_dataset_to_csv(d):
            f.write(r + '\n')

Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem. 可以从http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm捏破它的胆量。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))    
for f in files:
    print "files added {n}".format(n=n)
    concat_file.write(f())
    n+=1
concat_file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM