简体   繁体   English

查找单词的最后一次出现

[英]Finding the last occurrence of a word

I have the following string: 我有以下字符串:

<SEM>electric</SEM> cu <SEM>hello</SEM> rent <SEM>is<I>love</I>, <PARTITION />mind

I want to find the last "SEM" start tag before the "PARTITION" tag. 我想在“ PARTITION”标签之前找到最后一个“ SEM”开始标签。 not the SEM end tag but the start tag. 不是SEM结束标签,而是开始标签。 The result should be: 结果应为:

<SEM>is <Im>love</Im>, <PARTITION />

I have tried this regular expression: 我试过这个正则表达式:

<SEM>[^<]*<PARTITION[ ]/>

but it only works if the final "SEM" and "PARTITION" tags do not have any other tag between them. 但是只有在最后的“ SEM”和“ PARTITION”标签之间没有其他标签时,它才有效。 Any ideas? 有任何想法吗?

Use String.IndexOf to find PARTITION and String.LastIndexOf to find SEM? 使用String.IndexOf查找PARTITION,使用String.LastIndexOf查找SEM?

int partitionIndex = text.IndexOf("<PARTITION");
int emIndex = text.LastIndexOf("<SEM>", partitionIndex);

And here's your goofy Regex!!! 这是您愚蠢的Regex !!!

(?=[\s\S]*?\<PARTITION)(?![\s\S]+?\<SEM\>)\<SEM\>

What that says is "While ahead somewhere is a PARTITION tag... but while ahead is NOT another SEM tag... match a SEM tag." 这就是说:“在某处的前面是一个PARTITION标签……但是在前面的不是另一个SEM标签……与SEM标签匹配。”

Enjoy! 请享用!

Here's that regex broken down: 这是正则表达式分解:

(?=[\s\S]*?\<PARTITION) means "While ahead somewhere is a PARTITION tag"
(?![\s\S]+?\<SEM\>) means "While ahead somewhere is not a SEM tag"
\<SEM\> means "Match a SEM tag"

如果要使用正则表达式查找某些内容的最后出现,那么您可能还希望使用从右至左的解析正则表达式选项:

new Regex("...", RegexOptions.RightToLeft);

The solution is this, i have tested in http://regexlib.com/RETester.aspx 解决方案是这样,我已经在http://regexlib.com/RETester.aspx中进行了测试

<\s*SEM\s*>(?!.*</SEM>.*).*<\s*PARTITION\s*/> 

As you want the last one, the only way to identify is to find only the characters that don't contain </SEM> . 正如您想要的最后一个一样,唯一的识别方法是仅查找不包含</SEM>的字符。

I have included "\\s*" in case there are some spaces in <SEM> or <PARTITION/> . 如果<SEM> or <PARTITION/>有一些空格,我会添加“ \\ s *”。

Basically, what we do is exclude the word </SEM> with: 基本上,我们要做的是将</SEM>排除在外:

(?!.*</SEM>.*)

Have you tried this: 您是否尝试过:

<EM>.*<PARTITION\s*/>

Your regular expression was matching anything but "<" after the "EM" tag. 您的正则表达式在“ EM”标记后匹配除“ <”以外的任何内容。 Therefore it would stop matching when it hit the closing "EM" tag. 因此,当它碰到关闭的“ EM”标签时,它将停止匹配。

Bit quick-and-dirty, but try this: 有点麻烦,但请尝试以下操作:

(<SEM>.*?</SEM>.*?)*(<SEM>.*?<PARTITION)

and take a look at what's in the C#/.net equivalent of $2 看看C#/。net中相当于$ 2的内容

The secret lies in the lazy-matching construct (.*?) --- I assume/hope C# supports this. 秘密在于延迟匹配的构造(。*?)---我认为/希望C#支持这一点。

Clearly, Jon Skeet's solution will perform better, but you may want to use a regex (to simplify breaking up the bits that interest you, for example). 显然,乔恩·斯基特(Jon Skeet)的解决方案性能会更好,但是您可能要使用正则表达式(例如,以简化分解您感兴趣的部分)。

(Disclaimer: I'm a Perl/Python/Ruby person myself...) (免责声明:我本人是Perl / Python / Ruby人...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM