繁体   English   中英

解析文本文件中的数据

[英]Parsing data from a text file

我建立了一个联系表格,每次用户注册时都会向我发送电子邮件。我的问题与将某些文本数据解析为csv格式有关。 并且我在邮箱中收到了多个用户信息,这些信息已复制到一个文本文件中。 数据如下所示。

Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes

Name: testuser4
Email: tuser4@yahoo.com
Cluster Name: Mediterranea
Contact No.: 7892174896
Coming: Yes

Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2

Name: Test User
Email: testuser@yahoo.co.in
Cluster Name: RD
Contact No.: 09833123445
Coming: Yes
Members Participating: 2

可以看到数据包含一些公共字段和一些不存在的字段,我正在寻找有关如何解析此数据的解决方案/建议,因此在“名称”标题下,我将在该列下收集名称信息,并且对于其他人也是如此。 对于标题为“ Members Participating”的数据,我可以选择数字并将其添加到同一标题下的Excel工作表中,以防万一该信息对用户不存在,可以为空白。

以下程序可能满足您的要求。 总体策略:

  • 首先阅读所有电子邮件文件,“手动”解析数据,然后
  • 其次,使用csv.DictWriter.writerows()将数据写入CSV文件。

import sys
import pprint
import csv

# Usage:
# python cfg2csv.py input1.cfg input2.cfg ...
# The data is combined and written to 'output.csv'

def parse_file(data):
    total_result = []
    single_result = []
    for line in data:
        line = line.strip()
        if line:
            single_result.append([item.strip() for item in line.split(':', 1)])
        else:
            if single_result:
                total_result.append(dict(single_result))
            single_result = []
    if single_result:
        total_result.append(dict(single_result))
    return total_result

def read_file(filename):
    with open(filename) as fp:
        return parse_file(fp)

# First parse the data:
data = sum((read_file(filename) for filename in sys.argv[1:]), [])
keys = set().union(*data)

# Next write the data to a CSV file
with open('output.csv', 'w') as fp:
    writer = csv.DictWriter(fp, sorted(keys))
    writer.writeheader()
    writer.writerows(data)

您可以使用记录之间的空行来表示记录的结尾。 然后逐行处理输入文件并构建字典列表。 最后,将字典写到CSV文件中。

from csv import DictWriter
from collections import OrderedDict

with open('input') as infile:
    registrations = []
    fields = OrderedDict()
    d = {}
    for line in infile:
        line = line.strip()
        if line:
            key, value = [s.strip() for s in line.split(':', 1)]
            d[key] = value
            fields[key] = None
        else:
            if d:
                registrations.append(d)
                d = {}
    else:
        if d:    # handle EOF
            registrations.append(d)


# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()

with open('registrations.csv', 'w') as outfile:
    writer = DictWriter(outfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(registrations)

此代码尝试自动收集字段名称,并将使用与在输入中首次看到的唯一键相同的顺序。 如果您在输出中需要特定的现场订单,则可以通过取消注释相应的行来确定它。

在示例输入上运行此代码将产生以下结果:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes,
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2

让我们将问题分解为较小的子问题:

  1. 将大块文本分割为单独的注册
  2. 将每个注册转换为字典
  3. 将字典列表写入CSV

首先,让我们将注册数据块分解为不同的元素:

DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''

def parse_registrations(data):
    data = data.strip()
    return data.split('\n\n')

此功能为我们提供了每个注册的列表:

>>> regs = parse_registrations(DATA)
>>> regs
['Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
>>> regs[0]
'Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes'
>>> regs[1]
'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'

接下来,我们可以将这些子字符串转换为(键,值)对的列表:

>>> [field.split(': ', 1) for field in regs[0].split('\n')]
[['Name', 'testuser2'], ['Email', 'testuser2@gmail.com'], ['Cluster Name', 'o  b'], ['Contact No.', '12346971239'], ['Coming', 'Yes']]

dict()函数可以将(键,值)对的列表转换为字典:

>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o  b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}

我们可以将这些字典传递到csv.DictWriter中 ,以CSV 格式写入记录,并为所有缺失值提供默认值。

>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12

现在,让我们将所有这些结合在一起!

import csv

DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''

COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]

def parse_registration(reg):
    return dict(field.split(': ', 1) for field in reg.split('\n'))

def parse_registrations(data):
    data = data.strip()
    regs = data.split('\n\n')
    return [parse_registration(r) for r in regs]

def write_csv(data, filename):
    regs = parse_registrations(data)
    with open(filename, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=COLUMNS)
        writer.writeheader()
        writer.writerows(regs)

if __name__ == '__main__':
    write_csv(DATA, "/tmp/test.csv")

输出:

$ python3 write_csv.py

$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2

以下内容将自动将您的输入文本文件转换为CSV文件。 标题是根据最长的条目自动生成的。

import csv, re

with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
    csv_output = csv.writer(f_output)
    entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)

    # Determine the entry with the most fields for the CSV headers
    headings = []
    for entry in entries:
        headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
    csv_output.writerow(headings)

    # Write the entries
    for entry in entries:
        csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])

这将生成一个CSV文本文件,可以在Excel中打开,如下所示:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o  b,12346971239,Yes
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM