简体   繁体   English

Python正则表达式拆分逗号或空格分隔的字符串

[英]Python regex split comma or space separated string

I have shown an example of the input string and the output string I need to have. 我已经显示了我需要的输入字符串和输出字符串的示例。 The numbers and strings can be in any order and those can have any number or any string (eg 'THIS' can be some other string) 数字和字符串可以是任意顺序,并且它们可以具有任何数字或任何字符串(例如“ THIS”可以是其他字符串)

I need following output 我需要以下输出

["'5'", "'THIS'", "'/,'", "'4.2560'", "'0.34000E-01'"]

for all following input strings: 对于以下所有输入字符串:

""" 5,'THISMORETHAN4','/,',4.2560,0.34000E-01 """
""" 5,'THIS','/,',4.2560,0.34000E-01 """
"""5 , 'THIS' , '/,' , 4.2560 , 0.34000E-01 """
""" '5'  'THIS' '/,' '4.2560' '0.34000E-01' """
""" 5,'THIS','this','/,',4.2560,0.34000E-01 """
""" 5,'THIS','/,',4.2560,0.34000E-01 """

This is a continuation of previous question . 这是先前问题的延续。

  1. The strings can be comma separated or space separated. 字符串可以以逗号分隔或以空格分隔。 There may be or may not be spaces before or after a splitting comma. 逗号前后可能有空格,也可能没有。
  2. sub strings in single quotes may have special characters within (eg '/,' as shown above) 单引号中的子字符串中可能包含特殊字符(例如,“ /”,如上所示)

As an improved version of Padraic Cunningham's solution from your previous question, the regex (["']).*?\\1(?<!\\\\["'])|[^\\r\\n\\t\\f ,]+ will capture all your fields. 作为上一个问题的Padraic Cunningham解决方案的改进版本,正则表达式(["']).*?\\1(?<!\\\\["'])|[^\\r\\n\\t\\f ,]+将捕获您的所有字段。

The first part ( (["']).*?\\1(?<!\\\\["']) ) now also works with fields like 'asdf"' because the sourrounding quote characters have to be the same. It also works with escaped quotes because (?<!\\\\["']) asserts that there is no backslash before the second quote. 第一部分( (["']).*?\\1(?<!\\\\["']) ))现在也可以与'asdf"''asdf"'字段一起使用,因为环绕引号必须相同。使用转义引号,因为(?<!\\\\["'])断言第二个引号之前没有反斜杠。

If the first part doesn't match (ie there is no string surrounded by quotes), the second part ( [^\\r\\n\\t\\f ,]+ ) matches everything that is not a whitespace or a comma. 如果第一部分不匹配(即,引号中没有字符串),则第二部分( [^\\r\\n\\t\\f ,]+ )匹配所有非空格或逗号。 So it will ignore your delimiters but match everything else. 因此它将忽略您的分隔符,但会匹配其他所有内容。

import re

rows = [""" 5,'THISMORE"THAN4','/,',4.2560,0.34000E-01 """,
        #              ^ added quote character here
        """ 5,'TH\\'IS','/,',4.2560,0.34000E-01 """,
        #          ^ added escaped quote here
        """5 , 'THIS' , '/,' , 4.2560 , 0.34000E-01 """,
        """ '5'  'THIS' '/,' '4.2560' '0.34000E-01' """,
        """ 5,'THIS','this','/,',4.2560,0.34000E-01 """,
        """ 5,'THIS','/,',4.2560,0.34000E-01 """]

pattern = re.compile(r'(["\']).*?\1(?<!\\["\'])|[^\r\n\t\f ,]+')

result = [[m.group(0).strip('"\'') for m in pattern.finditer(row)]
          for row in rows]

import pprint
pprint.pprint(result)

Prints: 印刷品:

[['5', 'THISMORE"THAN4', '/,', '4.2560', '0.34000E-01'],
 ['5', "TH\\'IS", '/,', '4.2560', '0.34000E-01'],
 ['5', 'THIS', '/,', '4.2560', '0.34000E-01'],
 ['5', 'THIS', '/,', '4.2560', '0.34000E-01'],
 ['5', 'THIS', 'this', '/,', '4.2560', '0.34000E-01'],
 ['5', 'THIS', '/,', '4.2560', '0.34000E-01']]

What will still be problematic are unquoted fields that contain spaces within a line that has comma separation. 仍然有问题的是在引号中包含逗号分隔行中的空格的字段。 Therefore 因此

'hello there, "I actually", have, 5, fields'

Will result in: 将导致:

['hello','there','I actually','have','5','fields']

Do you have that in your data? 您的数据中有吗?

此正则表达式适用于所有测试用例

(\d)\W*\'([A-Z]{0,4})\w*\'.*(\/)\W*(\d*\.\d*)\W*(\d*\.\d*E-\d*)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM