[英]Python - Parsing structured text to excel
我需要将大量结构化文本格式的文件转换为excel(csv可以使用),以便将它们与我拥有的其他数据合并。 这是文本示例:
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: NORTHQUEST CAPITAL FUND INC
CENTRAL INDEX KEY: 0001142728
IRS NUMBER: 223772454
STATE OF INCORPORATION: NJ
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: NSAR-A
SEC ACT: 1940 Act
SEC FILE NUMBER: 811-10419
FILM NUMBER: 03805344
BUSINESS ADDRESS:
STREET 1: 16 RIMWOOD LANE
CITY: COLTS NECK
STATE: NJ
ZIP: 07722
BUSINESS PHONE: 7328423504
FORMER COMPANY:
FORMER CONFORMED NAME: NORTHPOINT CAPITAL FUND INC
DATE OF NAME CHANGE: 20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE> PAGE 1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003 000000 N
004 000000 N
005 000000 N
006 000000 N
007 A000000 N
007 B000000 0
007 C010100 1
007 C010200 2
007 C010300 3
007 C010400 4
007 C010500 5
007 C010600 6
007 C010700 7
007 C010800 8
007 C010900 9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018 000000 Y
019 A000000 N
019 B000000 0
<PAGE> PAGE 2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001 0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002 0
并继续到相同结构的第8页。 有关公司名称的信息应放在相对的列中,其余的应该像前两个值是列名,第三个值是行的值。
我试图用pyparsing解决问题,但未能成功完成。 关于该方法的任何评论都将有所帮助。
按照您的描述方式,它们就像每个文件的key:value对。 我将像这样处理解析部分:
import sys
import re
import csv
colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')
matchers = [colonseperated, fixedfields]
outfile = csv.writer(open('out.csv', 'w'))
outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:
for line in open(filename):
line = line.strip()
for matcher in matchers:
match = matcher.match(line)
if match:
outfile.writerow([filename] + list(match.groups()))
您可以将其称为parser.py
类的parser.py
,并使用python parser.py *.infile
或您的任何文件名约定进行调用。 它将创建一个包含三列的csv文件:文件名,键和值。 您可以在excel中打开它,然后使用数据透视表将值转换为正确的格式。
或者,您可以使用以下方法:
import csv
headers = []
rows = {}
filenames = []
outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()
for filename, key, value in infile:
if not filename in rows:
rows[filename] = {}
filenames.append(filename)
if key not in headers:
headers.append(key)
rows[filename][key] = value
outfile.writerow(headers)
for filename in filenames:
outfile.writerow([rows[filename].get(header, '') for header in headers])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.