简体   繁体   English

如何提取由<>界定并以逗号分隔的引号引起的元素列表-python,regex?

[英]How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Given a string like this: 给定这样的字符串:

ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel", 

With regex, how do I get a tuple that looks like the following: 使用正则表达式,如何获得如下元组:

('ORTH', ['cali.ber,kl','calf','done'])

I've been doing it as such: 我一直在这样做:

txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1

But i'm getting none for re.search().group(1) . 但是我对于re.search().group(1) How should it be done to get the desired output? 如何获得所需的输出?

The reason you don't get a match is that your regex doesn't match: 您没有找到匹配项的原因是您的正则表达式不匹配:

r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample. r"<([A-Za-z0-9_]+)>"缺少逗号,引号和空格字符,所有这些都可能出现在< >内部,具体< >您的示例。

This one would match: 这将匹配:

re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)

What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values , unescaped. 可能还会令您失望的是,名称列表由逗号分隔,而逗号本身可以是未转义的值的一部分。

That means you can't just split that string by ',' , but instead need to consider the two different quotation characters( ' and " ) in order to separate the fields. 这意味着您不能只用','分割字符串,而是需要考虑两个不同的引号字符( '" )以分隔字段。

So I'd use this approach: 所以我会使用这种方法:

  • Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest. 使用re.match将字符串拆分为PREFIX <NAMES>部分,并丢弃其余部分。
  • Use re.findall() to split the names into fields according to quotation marks 使用re.findall()根据引号将名称分成多个字段

Edit: 编辑:

1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines . 1)根据您的第一个评论,您的数据还可以在包含换行符的前缀之前包含一个前导。 The default behavior for . 的默认行为. is to match everything except newlines . 匹配换行符以外的所有东西

From the Python re docs: 从Python re docs:

re.DOTALL

Make the '.' 标记为'.' special character match any character at all, including a newline; 特殊字符完全匹配任何字符,包括换行符; without this flag, '.' 没有此标志, '.' will match anything except a newline. 将匹配换行符以外的任何内容。

So you need to construct that regex with the re.DOTALL flag. 因此,您需要使用re.DOTALL标志构造该正则表达式。 You do this by compiling it first and passing the OR ed flags: 您可以通过先编译并传递OR ed标志来做到这一点:

re.compile(pattern, flags=re.DOTALL)

2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. 2)如果您在正则表达式中的PREFIX之前包含空格字符,则它将仅与实际包含该空格的数据匹配-但不再与您的第一段示例数据匹配。 So I use .*?([AZ\\.]*)... to cover both cases. 所以我用.*?([AZ\\.]*)...来覆盖这两种情况。 The ? ? is for non-greedy matching, so it matches the shortest possible match instead of the longest. 用于非贪婪匹配,因此它匹配的是最短匹配而不是最长匹配。

3) To cover PREFIX.FOO just extend the pattern for the prefix to ([AZ\\.]*) by including the . 3)要覆盖PREFIX.FOO只需通过将前缀扩展为([AZ\\.]*) . character and escaping it. 角色并转义。

Updated example covering all the cases you mentioned: 更新的示例涵盖了您提到的所有情况:

import re

TEST_VALUES = [
    """ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
    """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]

EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])


pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)


for value in TEST_VALUES:
    prefix, names_str = pattern.match(value).groups()
    names = re.findall('[\'"](.*?)["\']', names_str)

    result = prefix, names
    assert(result == EXPECTED)

print result

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从逗号和引号混合的字符串中提取位置名称? (使用正则表达式或任何其他方法) - How do I extract location names from a string with mixed commas and quotation marks? (using Regex or any other methods) Python正则表达式:查找和替换引号之间的逗号 - Python regex: find and replace commas between quotation marks Python - 删除列表表示中的括号、逗号和引号 - Python - Remove brackets, commas and quotation marks in a representation of a list 你如何从python中的列表中删除引号? - How do you remove quotation marks form a list in python? 如何将txt文件中的元组添加到没有引号的列表中? 蟒蛇 - How do I add tuples from a txt file to a list without the quotation marks? Python 如何在 Python 中没有引号和括号的情况下打印它? - How do I print this without quotation marks and brackets in Python? 有没有办法忽略 python 中引号内的逗号? - Is there a way to ignore commas inside of quotation marks in python? 如何使用Python 2.7.10遍历列表并在引号之间提取文本 - How to iterate through a list and extract text between quotation marks using Python 2.7.10 当两个列表组合在一起时如何从列表项中删除括号、逗号和引号 - How to remove parenthesis, commas, and quotation marks from list items when two lists are combined together 当句子带有引号或逗号反引号时如何制作字符串? 蟒蛇 - How to make string when a sentence has quotation marks or inverted commas? Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM