How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Question

Given a string like this:

ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",

With regex, how do I get a tuple that looks like the following:

('ORTH', ['cali.ber,kl','calf','done'])

I've been doing it as such:

txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1

But i'm getting none for re.search().group(1) . How should it be done to get the desired output?

Answer 1

The reason you don't get a match is that your regex doesn't match:

r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample.

This one would match:

re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)

What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values , unescaped.

That means you can't just split that string by ',' , but instead need to consider the two different quotation characters( ' and " ) in order to separate the fields.

So I'd use this approach:

Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest.
Use re.findall() to split the names into fields according to quotation marks

Edit:

1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines . The default behavior for . is to match everything except newlines .

From the Python re docs:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

So you need to construct that regex with the re.DOTALL flag. You do this by compiling it first and passing the OR ed flags:

re.compile(pattern, flags=re.DOTALL)

2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([AZ\\.]*)... to cover both cases. The ? is for non-greedy matching, so it matches the shortest possible match instead of the longest.

3) To cover PREFIX.FOO just extend the pattern for the prefix to ([AZ\\.]*) by including the . character and escaping it.

Updated example covering all the cases you mentioned:

import re

TEST_VALUES = [
    """ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
    """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]

EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])


pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)


for value in TEST_VALUES:
    prefix, names_str = pattern.match(value).groups()
    names = re.findall('[\'"](.*?)["\']', names_str)

    result = prefix, names
    assert(result == EXPECTED)

print result

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Question

1 answers

solution1
2 ACCPTED 2013-08-06 17:46:16

Edit:

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Question

1 answers

solution1 2 ACCPTED 2013-08-06 17:46:16

Edit:

solution1
2 ACCPTED 2013-08-06 17:46:16