简体   繁体   English

使用正则表达式拆分,但使用分隔符的第一个字符

[英]Split with regex but with first character of delimiter

I have a regex like this: "[az|AZ|0-9]: " that will match one alphanumeric character, colon, and space.我有一个像这样的正则表达式: "[az|AZ|0-9]: " ,它将匹配一个字母数字字符、冒号和空格。 I wonder how to split the string but keeping the alphanumeric character in the first result of splitting.我想知道如何拆分字符串,但将字母数字字符保留在拆分的第一个结果中。 I cannot change the regex because there are some cases that the string will have special character before colon and space.我无法更改正则表达式,因为在某些情况下,字符串在冒号和空格之前会有特殊字符。

Example:例子:

line = re.split("[a-z|A-Z|0-9]: ", "A: ") # Result: ['A', '']
line = re.split("[a-z|A-Z|0-9]: ", ":: )5: ") # Result: [':: )5', '']
line = re.split("[a-z|A-Z|0-9]: ", "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']

Update: Actually, my problem is splitting from a review file.更新:实际上,我的问题是从审查文件中分离出来。 Suppose I have a file that every line has this pattern: [title]: [review] .假设我有一个文件,每一行都有这种模式: [title]: [review] I want to get the title and review, but some of the titles have a special character before a colon and space, and I don't want to match them.我想得到标题和评论,但是有些标题在冒号和空格之前有一个特殊字符,我不想匹配它们。 However, it seems that the character before a colon and space that I want to match apparently is an alphanumeric one.但是,我想匹配的冒号和空格之前的字符似乎是字母数字字符。

Solution解决方案

First of all, as you show in your examples, you need to match characters other than a-zA-Z0-9 , so we should just use the .首先,正如您在示例中所示,您需要匹配a-zA-Z0-9以外的字符,因此我们应该只使用. matcher, it will match every character.匹配器,它将匹配每个字符。

So I think the expression you're looking for might be this one:所以我认为你正在寻找的表达可能是这个:

(.*?):(?!.*:) (.*)

You can use it like so:你可以像这样使用它:

import re

pattern = r"(.*?):(?!.*:) (.*)"
matcher = re.compile(pattern)

txt1 = "A: "
txt2 = ":: )5: "
txt3 = "Delicious :): I want to eat this again"

result1 = matcher.search(txt1).groups() # ('A', '')
result2 = matcher.search(txt2).groups() # (':: )5', '')
result3 = matcher.search(txt3).groups() # ('Delicious :)', 'I want to eat this again')

Explanation解释

We use capture groups (the parentheses) to get the different parts in the string into different groups, search then finds these groups and outputs them in the tuple.我们使用捕获组(括号)将字符串中的不同部分放入不同的组中,然后search然后找到这些组并将它们输出到元组中。

The (?.:*:) part is called "Negative Lookahead", and we use it to make sure we start capturing from the last : we find. (?.:*:)部分称为“Negative Lookahead”,我们使用它来确保我们从最后一个开始捕获:我们找到。

Edit编辑

BTW, if, as you mentioned, you have many lines each containing a review, you can use this snippet to get all of the reviews separated by title and body at once:顺便说一句,如果,正如你所提到的,你有很多行,每行都包含一个评论,你可以使用这个片段来一次得到所有由标题和正文分隔的评论:

import re

pattern = r"(.*?):(?!.*:) (.*)\n?"
matcher = re.compile(pattern)

reviews = """ 
A: 
:: )5: 
Delicious :): I want to eat this again
"""

parsed_reviews = matcher.findall(reviews) # [('A', ''), (':: )5', ''), ('Delicious :)', 'I want to eat this again')]

You could split using a negative lookbehind with a single colon or use a character class [:)] where you can specify which characters should not occur directly to the left.您可以使用带有单个冒号的否定lookbehind 或使用字符 class [:)]进行拆分,您可以在其中指定哪些字符不应直接出现在左侧。

(?<!:):[ ]

In parts在零件

  • (?<::) Negative lookbehind, assert what is on the left is not a colon (?<::)否定后视,断言左边的不是冒号
  • :[ ] Match a colon followed by a space (Added square brackets only for clarity) :[ ]匹配一个冒号后跟一个空格(添加方括号只是为了清楚起见)

Regex demo |正则表达式演示| Python demo Python 演示

For example例如

import re
pattern = r"(?<!:): "
line = re.split(pattern, "A: ") # Result: ['A', '']
print(line)
line = re.split(pattern, ":: )5: ") # Result: [':: )5', '']
print(line)
line = re.split(pattern, "Delicious :): I want to eat this again") # Result: ['Delicious :)', 'I want to eat this again']
print(line)

Output Output

['A', '']
[':: )5', '']
['Delicious :)', 'I want to eat this again']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM