繁体   English   中英

Python-将结构化文本解析为Excel

[英]Python - Parsing structured text to excel

我需要将大量结构化文本格式的文件转换为excel(csv可以使用),以便将它们与我拥有的其他数据合并。 这是文本示例:

   FILER:

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         NORTHQUEST CAPITAL FUND  INC
        CENTRAL INDEX KEY:          0001142728
        IRS NUMBER:             223772454
        STATE OF INCORPORATION:         NJ
        FISCAL YEAR END:            1231

    FILING VALUES:
        FORM TYPE:      NSAR-A
        SEC ACT:        1940 Act
        SEC FILE NUMBER:    811-10419
        FILM NUMBER:        03805344

    BUSINESS ADDRESS:   
        STREET 1:       16 RIMWOOD LANE
        CITY:           COLTS NECK
        STATE:          NJ
        ZIP:            07722
        BUSINESS PHONE:     7328423504

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  NORTHPOINT CAPITAL FUND INC
        DATE OF NAME CHANGE:    20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE>      PAGE  1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003  000000 N
004  000000 N
005  000000 N
006  000000 N
007 A000000 N
007 B000000  0
007 C010100  1
007 C010200  2
007 C010300  3
007 C010400  4
007 C010500  5
007 C010600  6
007 C010700  7
007 C010800  8
007 C010900  9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018  000000 Y
019 A000000 N
019 B000000    0
<PAGE>      PAGE  2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001      0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002      0

并继续到相同结构的第8页。 有关公司名称的信息应放在相对的列中,其余的应该像前两个值是列名,第三个值是行的值。

我试图用pyparsing解决问题,但未能成功完成。 关于该方法的任何评论都将有所帮助。

按照您的描述方式,它们就像每个文件的key:value对。 我将像这样处理解析部分:

import sys
import re
import csv

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

matchers = [colonseperated, fixedfields]

outfile = csv.writer(open('out.csv', 'w'))

outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:   
    for line in open(filename):
        line = line.strip()
        for matcher in matchers:
            match = matcher.match(line)
            if match:
                outfile.writerow([filename] + list(match.groups()))

您可以将其称为parser.py类的parser.py ,并使用python parser.py *.infile或您的任何文件名约定进行调用。 它将创建一个包含三列的csv文件:文件名,键和值。 您可以在excel中打开它,然后使用数据透视表将值转换为正确的格式。

或者,您可以使用以下方法:

import csv

headers = []
rows = {}
filenames = []

outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()

for filename, key, value in infile:
    if not filename in rows:
        rows[filename] = {}
        filenames.append(filename)
    if key not in headers:
        headers.append(key)
    rows[filename][key] = value

outfile.writerow(headers)
for filename in filenames:
    outfile.writerow([rows[filename].get(header, '') for header in headers])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM