简体   繁体   English

在Python中进行CSV解析

[英]CSV parsing in Python

I want to parse a csv file which is in the following format: 我想解析一个csv文件,其格式如下:

Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3

Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3

Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3

Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3

and would like to turn this into tab seperated format like in the following: 并希望将其转换为制表符分隔格式,如下所示:

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3

TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3


TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3

TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

Number of TestAttributes vary from test to test. TestAttributes的数量因测试而异。 For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. 对于某些测试,只有3个值,对于其他一些7,等等。同样在TestName4示例中,一些测试执行多次,因此每次执行都有自己的TestAttributeValue行。 (in the example testname4 is executed 3 times, hence we have 3 value lines) (在示例testname4中执行3次,因此我们有3个值行)

I am new to python and do not have much knowledge but would like to parse the csv file with python. 我是python的新手,并没有太多的知识,但想用python解析csv文件。 I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? 我检查了python的'csv'库,无法确定它对我来说是否足够,还是我自己编写的字符串解析器? Could you please help me? 请你帮助我好吗?

Best 最好

The following does what you want, and only reads up to one section at a time (saves memory for a large file). 以下是您想要的,并且一次只能读取一个部分(为大文件保存内存)。 Replace in_path and out_path with the input and output file paths respectively: out_path用输入和输出文件路径替换in_pathout_path

import csv
def print_section(section, f_out):
    if len(section) > 0:
        # find maximum column length
        max_len = max([len(col) for col in section])
        # build and print each row
        for i in xrange(max_len):
            f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
        f_out.write('\n')

with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
    line = f_in.next()
    section = []
    for line in f_in:
        # test for new "Test" section
        if len(line) == 3 and line[0] == 'Test' and line[2] == '':
            # write previous section data
            print_section(section, f_out)
            # reset section
            section = []
            # write new section header
            f_out.write(line[1] + '\n')
        else:
            # add line to section
            section.append(line)
    # print the last section
    print_section(section, f_out)

Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line. 请注意,您要更改'Test'line[0] == 'Test'语句正确的单词,用于指示标题行。

The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven). 这里的基本思想是我们将文件导入到列表列表中,然后使用数组解析将列表列表写回来转置它(以及在列不均匀时添加空白元素)。

I'd use a solution using the itertools.groupby function and the csv module . 我使用itertools.groupby函数和csv模块使用解决方案。 Please have a close look at the documentation of itertools -- you can use it more often than you think! 请仔细查看itertools的文档 - 您可以比您想象的更频繁地使用它!

I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time: 我使用空行来区分数据集,这种方法使用延迟评估,一次只在内存中存储一​​个数据集:

import csv
from itertools import groupby

with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
    # Use the csv module to handle reading and writing of delimited files.
    reader = csv.reader(ifile)
    writer = csv.writer(ofile, delimiter='\t')
    # Skip info line
    next(reader)
    # Group datasets by the condition if len(row) > 0 or not, then filter
    # out all empty lines
    for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
        test_data = list(group)
        # Write header
        writer.writerow([test_data[0][1]])
        # Write transposed data
        writer.writerows(zip(*test_data[1:]))
        # Write blank line
        writer.writerow([])

Output, given that the supplied data is stored in my_data.csv : 输出,假设提供的数据存储在my_data.csv

TestName1
TestAttribute1-1    TestAttributeValue1-1
TestAttribute1-2    TestAttributeValue1-2
TestAttribute1-3    TestAttributeValue1-3

TestName2
TestAttribute2-1    TestAttributeValue2-1
TestAttribute2-2    TestAttributeValue2-2
TestAttribute2-3    TestAttributeValue2-3

TestName3
TestAttribute3-1    TestAttributeValue3-1
TestAttribute3-2    TestAttributeValue3-2
TestAttribute3-3    TestAttributeValue3-3

TestName4
TestAttribute4-1    TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2    TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3    TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM