简体   繁体   English

使用正则表达式查找 substring

[英]using regex to find substring

I am facing a problem with regex usage.我正面临正则表达式使用的问题。 I am using the following regex:我正在使用以下正则表达式:

\\S*the[^o\\s]*(?<!theo)\\b

The sentence that I am using is:我使用的句子是:

If the world says that theo is not oreo cookies then thetatheoder theotatheder thetatheder is extratheaterly good.如果全世界都说 theo 不是 oreo cookies 那么thetatheoder theotatheder thetatheder 是非常好的。

What i want from output is to have patterns: the, then, thetatheder, extratheaterly?我想从 output 得到模式:然后,thetatheder,extratheaterly?

So in short, I am okay with 'the(The)' as a complete string or substring in a string that does not contain 'theo'.所以简而言之,我可以将“the(The)”作为一个完整的字符串,或者将 substring 放在一个不包含“theo”的字符串中。

How can I modify my regex to achieve this?如何修改我的正则表达式来实现这一点? What I am thinking is to apply, pipe operation or question mark.我想的是申请,pipe 操作还是问号。 But none of them seems to be feasible.但它们似乎都不可行。

Generic通用的

If you want to design a general expression, maybe you can start with some expression similar to,如果你想设计一个通用的表达式,也许你可以从一些类似的表达式开始,

\S*the[^o\s]*\b

depending on what you'd like to match and not match, I guess.取决于你想匹配和不匹配,我猜。

Demo演示

Non-Generic非通用

I guess you can simply find word boundaries ( \b ) helpful to solve your problem, with some simple expression similar to,我想您可以简单地找到有助于解决您的问题的单词边界( \b ),使用类似于以下的简单表达式,

\b[Tt]he\b|\b[Tt]hen\b|\bextratheaterly\b

Demo 1演示 1

Or,或者,

\b(?:[Tt]hen?|[Ee]xtratheaterly)\b

Demo 2演示 2

Java Test Java 测试

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class RegularExpression{

    public static void main(String[] args){

        final String regex = "\\b(?:[Tt]hen?|[Ee]xtratheaterly)\\b";
        final String string = "If the world says that theo is not oreo cookies then thetatheoder is extratheaterly good.\n\n"
             + "If The world says that theo is not oreo cookies Then thetatheoder is Extratheaterly good.\n\n"
             + "If notthe world says that theo is not oreo cookies notthen thetatheoder is notextratheaterly good.\n\n\n";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }


    }
}

Output Output

Full match: the
Full match: then
Full match: extratheaterly
Full match: The
Full match: Then
Full match: Extratheaterly

Python Test Python 测试

import re
string = '''
If the world says that theo is not oreo cookies then thetatheoder is extratheaterly good.

If The world says that theo is not oreo cookies Then thetatheoder is Extratheaterly good.

If notthe world says that theo is not oreo cookies notthen thetatheoder is notextratheaterly good.
'''

expression = r'\b(?:[Tt]hen?|[Ee]xtratheaterly)\b'

print(re.findall(expression, string))
print([m.group(0) for m in re.finditer(expression, string)])

Output Output

['the', 'then', 'extratheaterly', 'The', 'Then', 'Extratheaterly']
['the', 'then', 'extratheaterly', 'The', 'Then', 'Extratheaterly']

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com .如果您想简化/修改/探索表达式,它已在regex101.com的右上角面板上进行了解释。 If you'd like, you can also watch in this link , how it would match against some sample inputs.如果您愿意,您还可以在此链接中观看它如何与一些示例输入匹配。


RegEx Circuit正则表达式电路

jex.im visualizes regular expressions: jex.im可视化正则表达式:

在此处输入图像描述

\b[A-Za-z]*he([a-z](?<!theo))*\b

matches the, then, extratheaterly匹配,然后,在剧院外

\b word boundary \b 字边界

[A-Za-z] matches any letter [A-Za-z] 匹配任何字母

[az] matches any lowercase letter [az] 匹配任何小写字母

* matches 0 or more * 匹配 0 个或更多

([a-z](?<!theo))*

This is the tricky part.这是棘手的部分。 It say any letter, make sure it doesn't spell theo (looking behind) after adding that letter它说任何字母,确保在添加该字母后它不拼写 theo(向后看)

Look at negative lookbehind and negative lookaheads.看看消极的后视和消极的前瞻。

You might use the \S in a negative lookbehind as a start boundary and a negative lookahead to make sure the word does not contain theo.您可以在否定的lookbehind 中使用\S作为起始边界和否定的lookahead,以确保单词不包含theo。

To match The or the you could make the pattern case insensitive.要匹配 The 或 the 您可以使模式不区分大小写。

(?<!\S)(?!\S*theo\S*)\S*the\S*

In parts在零件

  • (?<!\S) Negative lookbehind, assert what is on the left is not a non whitspace char (?<!\S)否定后视,断言左边的不是非空白字符
  • (?!\S*theo\S*) Negative lookahead, assert what is on the right does not contain theo (?!\S*theo\S*)负前瞻,断言右边的内容不包含theo
  • \S*the\S* Match the surrounded by matching 0+ times a non whitespace char \S*the\S* the 0+ 次非空白字符包围的匹配

Regex demo正则表达式演示

If you are only using word characters, you could also make use of word boundaries \b如果您只使用单词字符,您还可以使用单词边界\b

\b(?!\w*theo\w*)\w*the\w*\b

Regex demo正则表达式演示

Or you might assert that a part of the word is the and match it using an assertion that if you match a t it should not be followed by heo或者你可以断言单词的一部分是the并使用断言匹配它,如果你匹配一个t它不应该跟heo

\b(?=\S*the\S*)[^t\s]*(?:t(?!heo)[^t\s]*)+\b

Regex demo正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM