Python-将结构化文本解析为Excel

Question

I need to convert a huge number of files in structured text format into excel (csv would work) to be able to merge them with some other data I have. 我需要将大量结构化文本格式的文件转换为excel（csv可以使用），以便将它们与我拥有的其他数据合并。 Here is a sample of the text: 这是文本示例：

   FILER:

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         NORTHQUEST CAPITAL FUND  INC
        CENTRAL INDEX KEY:          0001142728
        IRS NUMBER:             223772454
        STATE OF INCORPORATION:         NJ
        FISCAL YEAR END:            1231

    FILING VALUES:
        FORM TYPE:      NSAR-A
        SEC ACT:        1940 Act
        SEC FILE NUMBER:    811-10419
        FILM NUMBER:        03805344

    BUSINESS ADDRESS:   
        STREET 1:       16 RIMWOOD LANE
        CITY:           COLTS NECK
        STATE:          NJ
        ZIP:            07722
        BUSINESS PHONE:     7328423504

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  NORTHPOINT CAPITAL FUND INC
        DATE OF NAME CHANGE:    20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE>      PAGE  1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003  000000 N
004  000000 N
005  000000 N
006  000000 N
007 A000000 N
007 B000000  0
007 C010100  1
007 C010200  2
007 C010300  3
007 C010400  4
007 C010500  5
007 C010600  6
007 C010700  7
007 C010800  8
007 C010900  9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018  000000 Y
019 A000000 N
019 B000000    0
<PAGE>      PAGE  2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001      0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002      0

and it continues to page 8 of the same structure. 并继续到相同结构的第8页。 The information about the company's name should go into relative columns and the rest should be like the first two values are the column names and the third value would be the value of the row. 有关公司名称的信息应放在相对的列中，其余的应该像前两个值是列名，第三个值是行的值。

I was trying to work it out with pyparsing but haven't been able to successfully do so. 我试图用pyparsing解决问题，但未能成功完成。 Any comment on the approach would be helpful. 关于该方法的任何评论都将有所帮助。

Answer 1

The way you describe it, these are like key:value pairs for each file. 按照您的描述方式，它们就像每个文件的key：value对。 I would handle the parsing part like this: 我将像这样处理解析部分：

import sys
import re
import csv

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

matchers = [colonseperated, fixedfields]

outfile = csv.writer(open('out.csv', 'w'))

outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:   
    for line in open(filename):
        line = line.strip()
        for matcher in matchers:
            match = matcher.match(line)
            if match:
                outfile.writerow([filename] + list(match.groups()))

You can call this something like parser.py and call it with python parser.py *.infile or whatever your filename convention is. 您可以将其称为parser.py类的parser.py ，并使用python parser.py *.infile或您的任何文件名约定进行调用。 It will create a csv file with three columns: a filename, a key and a value. 它将创建一个包含三列的csv文件：文件名，键和值。 You can open this in excel and then use a pivot table to get the values into the correct format. 您可以在excel中打开它，然后使用数据透视表将值转换为正确的格式。

Alternatively you can use this: 或者，您可以使用以下方法：

import csv

headers = []
rows = {}
filenames = []

outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()

for filename, key, value in infile:
    if not filename in rows:
        rows[filename] = {}
        filenames.append(filename)
    if key not in headers:
        headers.append(key)
    rows[filename][key] = value

outfile.writerow(headers)
for filename in filenames:
    outfile.writerow([rows[filename].get(header, '') for header in headers])

Python-将结构化文本解析为Excel

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-09-21 09:38:05

Python-将结构化文本解析为Excel

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-09-21 09:38:05

解决方案1
2 已采纳 2013-09-21 09:38:05