简体   繁体   English

在python中使用正则表达式从文件中读取数据

[英]Reading in data from file using regex in python

I have a data file with tons of data like: 我有一个包含大量数据的数据文件,例如:

{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}

I want to read in the data and save it in a list. 我想读取数据并将其保存在列表中。 I am having trouble getting the exact right code to exact the data between the { }. 我在获取正确的代码以正确{}之间的数据时遇到麻烦。 I don't want the quotes and the ` after the numbers. 我不要数字后的引号和`。 Also, data is not separated by line so how do I tell re.search where to begin looking for the next set of data? 另外,数据不是由行分隔的,所以我如何告诉re.search从哪里开始寻找下一组数据?

At first glance, you can break this data into chunks by splitting it on the string },{ : 乍一看,您可以通过在字符串},{上将其拆分为多个数据,将其分成多个块:

chunks = data.split('},{')
chunks[0] = chunks[0][1:]      # first chunk started with '{'
chunks[-1] = chunks[-1][:-1]   # last chunk ended with '}'

Now you have chunks like 现在你有像

"Passenger Quarters",27.`,"Cardassian","not injured"

and you can apply a regular expression to them. 您可以对它们应用正则表达式。

The following will produce a list of lists, where each list is an individual record. 下面将产生一个列表列表,其中每个列表都是一个单独的记录。

import re

data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Pssenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'

# remove characters we don't want and split into individual fields
badchars = ['{','}','`','.','"']
newdata = data.translate(None, ''.join(badchars))
fields = newdata.split(',')

# Assemble groups of 4 fields into separate lists and  append
# to the parent list.  Obvious weakness here is if there are
# records that contain something other than 4 fields
records = []
myrecord = []
recordcount = 1
for field in fields:
    myrecord.append(field)
    recordcount = recordcount + 1
    if (recordcount > 4):
        records.append(myrecord)
        myrecord = []
        recordcount = 1

for record in records:
    print record

Output: 输出:

['Passenger Quarters', '27', 'Cardassian', 'not injured']
['Passenger Quarters', '9', 'Cardassian', 'injured']
['Pssenger Quarters', '32', 'Romulan', 'not injured']
['Bridge', 'Unknown', 'Romulan', 'not injured']

You should do this in two passes. 您应该分两次通过。 One to get the list of items and one to get the contents of each item: 一个用于获取项目列表,另一个用于获取每个项目的内容:

import re
from pprint import pprint

data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'

# This splits up the data into items where each item is the
# contents inside a pair of braces
item_pattern = re.compile("{([^}]+)}")

# This plits up each item into it's parts. Either matching a string
# inside quotation marks or a number followed by some garbage
contents_pattern = re.compile('(?:"([^"]+)"|([0-9]+)[^,]+),?')

rows = []
for item in item_pattern.findall(data):
    row = []
    for content in contents_pattern.findall(item):
        if content[1]: # Number matched, treat it as one
            row.append(int(content[1]))
        else: # Number not matched, use the string (even if empty)
            row.append(content[0])
    rows.append(row)

pprint(rows)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM