简体   繁体   English

正则表达式拆分:如果后跟短 substring,则忽略分隔符

[英]regex split: ignore delimiter if followed by short substring

I have a csv file in which pipes serve as delimiters.我有一个 csv 文件,其中管道用作分隔符。 But sometimes a short substring follows the 3rd pipe: up to 2 alphanumeric characters behind it.但有时在第 3 个 pipe 后面会出现一个简短的 substring:后面最多有 2 个字母数字字符。 Then the 3rd pipe should not be interpreted as a delimiter.那么第三个 pipe 不应被解释为分隔符。

example: split on each pipe:示例:在每个 pipe 上拆分:

x1 = "as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY"

=> split after XXL because it is followed by more than 2 characters => 在 XXL 之后拆分,因为它后面有超过 2 个字符

examples: split on all pipes except the 3rd if there are less than 3 characters between pipes 3 and 4:示例:如果管道 3 和 4 之间的字符少于 3 个,则拆分除第 3 个管道以外的所有管道:

x2 = "as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf"

x3 = "as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"

=> keep "1g|z4" and "a1|2" together. => 将“1g|z4”和“a1|2”放在一起。

My regex attempts only suffice for a substring replacement like this one: It replaces the pipe with a hyphen if it finds it between 2 digits: 3|4 => 3-4.我的正则表达式尝试仅适用于像这样的 substring 替换:如果在 2 位数字之间找到 pipe,它将用连字符替换它:3|4 => 3-4。

x = re.sub(r'(?<=\d)\|(?=\d)', repl='-', string=x1, count=1).

My question is: If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. My question is: If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. All other pipes but #3 are unconditional delimiters.除了#3 之外的所有其他管道都是无条件分隔符。

You can use re.sub to add quotechar around the short columns.您可以使用re.sub在短列周围添加 quotechar。 Then use Python's builtin csv module to parse the text ( regex101 of the used expression)然后使用 Python 的内置csv模块解析文本(使用表达式的regex101

import re
import csv
from io import StringIO

txt = """\
as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY
as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf
as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"""


pat = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
txt = pat.sub(r'\1"\2"', txt)

reader = csv.reader(StringIO(txt), delimiter="|", quotechar='"')
for line in reader:
    print(line)

Prints:印刷:

['as234-HJ123-HG', 'dfdf KHT werg', 'XXL', 's45dtgIKU', '2017-SS0', '123.45', 'asUJY']
['as234-H344423-dfX', 'dfer XXYUyu werg', '1g|z4', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
['as234-H3wer23-dZ', 'df3r Xa12yu wg', 'a1|2', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']

I adapted Andrej's solution as follows: Assume that the dataframe has already been imported from csv without parsing.我对 Andrej 的解决方案进行了如下调整:假设 dataframe 已经从 csv 导入而无需解析。

To split the dataframe's single column 0, apply a function that checks if the 3rd pipe is a qualified delimiter.要拆分数据帧的单列 0,请应用 function 来检查第三个 pipe 是否是合格的分隔符。

pat1 is Andrej's solution for identifying if substring4 after the 3rd pipe is longer than 2 characters or not. pat1 是 Andrej 的解决方案,用于识别第三个 pipe 之后的 substring4 是否超过 2 个字符。 If it is short, then substring3, pipe3 and substring4 are enclosed within double quotes in text x (in a dataframe, this result type differs from the list shown by the print loop).如果它很短,则 substring3、pipe3 和 substring4 在文本 x 中用双引号括起来(在 dataframe 中,此结果类型与打印循环显示的列表不同)。 This part could be replaced by a different regex if your own criterion for "delimiter to ignore" differs from the example.如果您自己的“要忽略的分隔符”标准与示例不同,则可以用不同的正则表达式替换此部分。

Next I replace the disqualified pipe(s), those between double quotes, with a hyphen: pat2 in re.sub.接下来,我用连字符替换不合格的管道,即双引号之间的管道:re.sub 中的 pat2。 The function returns the resulting text y to the new dataframe column "out". function 将结果文本 y 返回到新的 dataframe 列“out”。

We can get rid of the double quotes in the entire column.我们可以去掉整列中的双引号。 They were only needed for the replacements.他们只需要更换。

Finally, we split column "out" into multiple columns by using all remaining pipe delimiters in str.split.最后,我们使用 str.split 中所有剩余的 pipe 分隔符将“out”列拆分为多个列。

I suppose my 3 steps could be combined to fewer steps (first enclose 3rd pipe in double quotes if the pipe matches a pattern that disqualifies it as delimiter, then replace the disqualified pipe with a hyphen, then split the text/column). I suppose my 3 steps could be combined to fewer steps (first enclose 3rd pipe in double quotes if the pipe matches a pattern that disqualifies it as delimiter, then replace the disqualified pipe with a hyphen, then split the text/column). But I'm happy enough that this 3-step solution works.但我很高兴这个 3 步解决方案有效。

# identify if 3rd pipe is a valid delimiter:
def cond_replace_3rd_pipe(row):
    # put substring3, 3rd pipe and short substring4 between double quotes
    pat1 = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
    x = pat1.sub(r'\1"\2"', row[0])
    
    # replaces pipes between double quotes with hyphen
    pat2 = r'"(.+)\|(.+)"'
    y = re.sub(pat2, r'"\1-\2"', x)
    return y


df["out"] = df.apply(cond_replace_3rd_pipe, axis=1, result_type="expand")
df["out"] = df["out"].str.replace('"', "")    # double quotes no longer needed
df["out"].str.split('|', expand=True)   # split out into separate columns at all remaining pipes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM