[英]Regex expression: number of characters containing a pattern
I have a dataframe with the following structure:我有一个具有以下结构的 dataframe:
Desc_ORF ![]() |
ORF![]() |
---|---|
beta-glucosidase ![]() |
tb512 ![]() |
succinate-semialdehyde dehydrogenase![]() |
tb111 ![]() |
probable epoxide hydrolase![]() |
tb045 ![]() |
I am using this function to filter the dataframe:我正在使用这个 function 来过滤 dataframe:
df.set_index('Desc_ORF').filter(regex=pattern, axis=0)
It is working perfectly fine with other patterns that I'am trying, but I cannot obtain a regex pattern that filters the rows where the Desc_ORF contains hydro into a word of 13 characters.它与我正在尝试的其他模式完美配合,但我无法获得将 Desc_ORF 包含hydro的行过滤为 13 个字符的单词的正则表达式模式。
For example: My code should keep the row succinate-semialdehyde dehydrogenase because it contains dehydrogenase that has 13 characters and contains the pattern hydro .例如:我的代码应该保留succinate-semialdehyde dehydrogenase这一行,因为它包含有 13 个字符的dehydrogenase并包含模式hydro 。 On the other hand, the filter must discard probable epoxide hydrolase because, although it contains hydro , the word hydrolase is not of 13 characters.
另一方面,过滤器必须丢弃可能的环氧化物水解酶,因为虽然它包含hydro ,但单词 hydrolase 不是 13 个字符。
Desc_ORF ![]() |
ORF![]() |
---|---|
succinate-semialdehyde dehydrogenase![]() |
tb111 ![]() |
I have tried different patterns and my last try has been: ^(?={13}$)(\b\S hydro\S \b).我尝试了不同的模式,最后一次尝试是:^(?={13}$)(\b\S hydro\S \b)。 With this pattern I am only filtering by words that contain hydro, but I cannot obtain words that contain hydro with a length of 13 characters.
使用此模式,我仅按包含 hydro 的单词进行过滤,但我无法获得包含 hydro 且长度为 13 个字符的单词。
One option to match the word in the second line could be:匹配第二行中的单词的一种选择可能是:
(?<!\S)(?=\S{13}(?!\S))\S*hydro\S*
(?<!\S)
Assert a whitespace bounadary on the left (?<!\S)
在左边断言一个空白边界(?=\S{13}(?!\S))
Assert 13 non whitespace chars followed by a whitespace boundary (?=\S{13}(?!\S))
断言 13 个非空白字符后跟空白边界\S*hydro\S*
Match hydro between optional non whitespace chars \S*hydro\S*
匹配可选非空白字符之间的水电pattern=r"(?<!\S)(?=\S{13}(?!\S))\S*hydro\S*"
df = df.set_index('Desc_ORF').filter(regex=pattern, axis=0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.