正则表达式：包含模式的字符数

Question

I have a dataframe with the following structure:我有一个具有以下结构的 dataframe：

Desc_ORF Desc_ORF	ORF开放阅读框
beta-glucosidase β-葡萄糖苷酶	tb512 TB512
succinate-semialdehyde dehydrogenase琥珀酸-半醛脱氢酶	tb111 tb111
probable epoxide hydrolase可能的环氧化物水解酶	tb045 tb045

I am using this function to filter the dataframe:我正在使用这个 function 来过滤 dataframe：

df.set_index('Desc_ORF').filter(regex=pattern, axis=0)

It is working perfectly fine with other patterns that I'am trying, but I cannot obtain a regex pattern that filters the rows where the Desc_ORF contains hydro into a word of 13 characters.它与我正在尝试的其他模式完美配合，但我无法获得将 Desc_ORF 包含hydro的行过滤为 13 个字符的单词的正则表达式模式。

For example: My code should keep the row succinate-semialdehyde dehydrogenase because it contains dehydrogenase that has 13 characters and contains the pattern hydro .例如：我的代码应该保留succinate-semialdehyde dehydrogenase这一行，因为它包含有 13 个字符的dehydrogenase并包含模式hydro 。 On the other hand, the filter must discard probable epoxide hydrolase because, although it contains hydro , the word hydrolase is not of 13 characters.另一方面，过滤器必须丢弃可能的环氧化物水解酶，因为虽然它包含hydro ，但单词 hydrolase 不是 13 个字符。

Desc_ORF Desc_ORF	ORF开放阅读框
succinate-semialdehyde dehydrogenase琥珀酸-半醛脱氢酶	tb111 tb111

I have tried different patterns and my last try has been: ^(?={13}$)(\b\S hydro\S \b).我尝试了不同的模式，最后一次尝试是：^(?={13}$)(\b\S hydro\S \b)。 With this pattern I am only filtering by words that contain hydro, but I cannot obtain words that contain hydro with a length of 13 characters.使用此模式，我仅按包含 hydro 的单词进行过滤，但我无法获得包含 hydro 且长度为 13 个字符的单词。

Answer 1

One option to match the word in the second line could be:匹配第二行中的单词的一种选择可能是：

(?<!\S)(?=\S{13}(?!\S))\S*hydro\S*

(?<!\S) Assert a whitespace bounadary on the left (?<!\S)在左边断言一个空白边界
(?=\S{13}(?!\S)) Assert 13 non whitespace chars followed by a whitespace boundary (?=\S{13}(?!\S))断言 13 个非空白字符后跟空白边界
\S*hydro\S* Match hydro between optional non whitespace chars \S*hydro\S*匹配可选非空白字符之间的水电

Regex demo正则表达式演示

pattern=r"(?<!\S)(?=\S{13}(?!\S))\S*hydro\S*"
df = df.set_index('Desc_ORF').filter(regex=pattern, axis=0)

正则表达式：包含模式的字符数

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-01-07 08:43:51

正则表达式：包含模式的字符数

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-01-07 08:43:51

解决方案1
2 已采纳 2021-01-07 08:43:51