I read a text file line by line with a Python script . What I get is a list of strings, one string per line. I now need to parse each string into more manageable data (ie strings, integers).
The strings look similar to this:
Now what I want is a list of split/parsed strings for each line from the text file that I can process further, for instance:
I guess regular expressions are the best fit to accomplish this and I already managed to come up with this:
(.+)\\s+((\\d+)\\)
, which perfectly matches for instance [ “door", "0" ] for "door (0)" However, some items have more data to parse:
(.+)\\s((\\d+)+\\|\\)
, which matches only [ "window", "1" ] for "window (1|22|4) How can I repeat the pattern matching for the part (\\d+)+\\|
(ie "1|") up to the closing parenthesis for an undefined number repetitions of this pattern? The last item to match would be an integer, which could be caught separately with (\\d+)\\)
.
Also is there a way to match either the simple or the extended case with a single regular expression?
Thanks! And have a nice weekend, everybody!
Here's the regex: \\w+ \\((\\d+\\|)*\\d+\\)
. But imo you should do a mix of regex and str.split
data = []
with open("f.txt") as f:
for line in f:
word, numbers = re.search(r"(\w+) \(([^)]+)\)", line).groups()
data.append((word, *numbers.split("|")))
print(data) # [('door', '0'), ('window', '1', '22', '4')]
import re
a = [r'door (0)',
r'window (1|22|4)',
r'toilet (2|6|5|10)'
]
for i in a:
print(re.findall('(\w+)',i))
Result:
['door', '0']
['window', '1', '22', '4']
['toilet', '2', '6', '5', '10']
Not a raw regex, but another way to extract and process that data can be to use TTP template
from ttp import ttp
template = """
<macro>
def process_matches(data):
data["numbers"] = data["numbers"].split("|")
return data
</macro>
<group name="{{ thing }}" macro="process_matches">
{{ thing }} ({{ numbers }})
</group>
"""
data = """
door (0)
window (1|22|4)
toilet (2|6|5|10)
"""
parser = ttp(data, template)
parser.parse()
print(parser.result(format="pprint")[0])
above code would produce
[ { 'door': {'numbers': ['0']},
'toilet': {'numbers': ['2', '6', '5', '10']},
'window': {'numbers': ['1', '22', '4']}}]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.