简体   繁体   English

正则表达式:包含模式的字符数

[英]Regex expression: number of characters containing a pattern

I have a dataframe with the following structure:我有一个具有以下结构的 dataframe:

Desc_ORF Desc_ORF ORF开放阅读框
beta-glucosidase β-葡萄糖苷酶 tb512 TB512
succinate-semialdehyde dehydrogenase琥珀酸-半醛脱氢酶 tb111 tb111
probable epoxide hydrolase可能的环氧化物水解酶 tb045 tb045

I am using this function to filter the dataframe:我正在使用这个 function 来过滤 dataframe:

df.set_index('Desc_ORF').filter(regex=pattern, axis=0)

It is working perfectly fine with other patterns that I'am trying, but I cannot obtain a regex pattern that filters the rows where the Desc_ORF contains hydro into a word of 13 characters.它与我正在尝试的其他模式完美配合,但我无法获得将 Desc_ORF 包含hydro的行过滤为 13 个字符的单词的正则表达式模式。

For example: My code should keep the row succinate-semialdehyde dehydrogenase because it contains dehydrogenase that has 13 characters and contains the pattern hydro .例如:我的代码应该保留succinate-semialdehyde dehydrogenase这一行,因为它包含有 13 个字符的dehydrogenase并包含模式hydro On the other hand, the filter must discard probable epoxide hydrolase because, although it contains hydro , the word hydrolase is not of 13 characters.另一方面,过滤器必须丢弃可能的环氧化物水解酶,因为虽然它包含hydro ,但单词 hydrolase 不是 13 个字符。

Desc_ORF Desc_ORF ORF开放阅读框
succinate-semialdehyde dehydrogenase琥珀酸-半醛脱氢酶 tb111 tb111

I have tried different patterns and my last try has been: ^(?={13}$)(\b\S hydro\S \b).我尝试了不同的模式,最后一次尝试是:^(?={13}$)(\b\S hydro\S \b)。 With this pattern I am only filtering by words that contain hydro, but I cannot obtain words that contain hydro with a length of 13 characters.使用此模式,我仅按包含 hydro 的单词进行过滤,但我无法获得包含 hydro 且长度为 13 个字符的单词。

One option to match the word in the second line could be:匹配第二行中的单词的一种选择可能是:

(?<!\S)(?=\S{13}(?!\S))\S*hydro\S*
  • (?<!\S) Assert a whitespace bounadary on the left (?<!\S)在左边断言一个空白边界
  • (?=\S{13}(?!\S)) Assert 13 non whitespace chars followed by a whitespace boundary (?=\S{13}(?!\S))断言 13 个非空白字符后跟空白边界
  • \S*hydro\S* Match hydro between optional non whitespace chars \S*hydro\S*匹配可选非空白字符之间的水电

Regex demo正则表达式演示

pattern=r"(?<!\S)(?=\S{13}(?!\S))\S*hydro\S*"
df = df.set_index('Desc_ORF').filter(regex=pattern, axis=0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM