[英]Reading in data from file using regex in python
我有一個包含大量數據的數據文件,例如:
{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}
我想讀取數據並將其保存在列表中。 我在獲取正確的代碼以正確{}之間的數據時遇到麻煩。 我不要數字后的引號和`。 另外,數據不是由行分隔的,所以我如何告訴re.search從哪里開始尋找下一組數據?
乍一看,您可以通過在字符串},{
上將其拆分為多個數據,將其分成多個塊:
chunks = data.split('},{')
chunks[0] = chunks[0][1:] # first chunk started with '{'
chunks[-1] = chunks[-1][:-1] # last chunk ended with '}'
現在你有像
"Passenger Quarters",27.`,"Cardassian","not injured"
您可以對它們應用正則表達式。
下面將產生一個列表列表,其中每個列表都是一個單獨的記錄。
import re
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Pssenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# remove characters we don't want and split into individual fields
badchars = ['{','}','`','.','"']
newdata = data.translate(None, ''.join(badchars))
fields = newdata.split(',')
# Assemble groups of 4 fields into separate lists and append
# to the parent list. Obvious weakness here is if there are
# records that contain something other than 4 fields
records = []
myrecord = []
recordcount = 1
for field in fields:
myrecord.append(field)
recordcount = recordcount + 1
if (recordcount > 4):
records.append(myrecord)
myrecord = []
recordcount = 1
for record in records:
print record
輸出:
['Passenger Quarters', '27', 'Cardassian', 'not injured']
['Passenger Quarters', '9', 'Cardassian', 'injured']
['Pssenger Quarters', '32', 'Romulan', 'not injured']
['Bridge', 'Unknown', 'Romulan', 'not injured']
您應該分兩次通過。 一個用於獲取項目列表,另一個用於獲取每個項目的內容:
import re
from pprint import pprint
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# This splits up the data into items where each item is the
# contents inside a pair of braces
item_pattern = re.compile("{([^}]+)}")
# This plits up each item into it's parts. Either matching a string
# inside quotation marks or a number followed by some garbage
contents_pattern = re.compile('(?:"([^"]+)"|([0-9]+)[^,]+),?')
rows = []
for item in item_pattern.findall(data):
row = []
for content in contents_pattern.findall(item):
if content[1]: # Number matched, treat it as one
row.append(int(content[1]))
else: # Number not matched, use the string (even if empty)
row.append(content[0])
rows.append(row)
pprint(rows)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.