简体   繁体   English

如何在带有不同数量空白的python中使用正则表达式

[英]How to use regex in python with varying amounts of white spaces

I'm trying to turn reformat my data here: 我试图在这里重新格式化我的数据:

gi|492845765|ref|WP_005999719.1| GI | 492845765 |裁判| WP_005999719.1 | DNA methyltransferase [[Eubacterium] infirmum] DNA甲基转移酶[[Eubacterium infirmum]

into

[[Eubacterium]infirmum]gi|492845765| [真杆菌] infirmum] GI | 492845765 |

That is, I just want to just keep the gi number and the organism name (with the organism name in front of the gi number), and get rid of the "extra" information (in this case, ref number and "DNA methyltransferase"). 也就是说,我只想保留gi号和生物名称(在gi号前面加上生物名称),并删除“额外”信息(在这种情况下,参考号和“ DNA甲基转移酶” )。

I would do re.sub(r"(\\w+ |\\w + |) \\w+|\\w_\\w|\\s\\w+\\s\\w\\s ([.]), \\2\\1, line) 我会做re.sub(r“(\\ w + | \\ w + |)\\ w + | \\ w_ \\ w | \\ s \\ w + \\ s \\ w \\ s([。]),\\ 2 \\ 1,行)

(or something remotely like that) (或类似的东西)

However, some other lines of my data have more than two words in the "extra" information. 但是,我的数据的其他几行在“额外”信息中有两个以上的单词。 example: 例:

gi|548229945|ref|WP_022448665.1| GI | 548229945 |裁判| WP_022448665.1 | dNA (Cytosine-5-)-methyltransferase [Roseburia sp. dNA(Cytosine-5-)-methyltransferase [Roseburia sp。 CAG:303] CAG:303]

How would I write a regex expression to rename all of my data so that the organism name is in front, the gi numbers next, and everything else deleted? 我将如何编写一个正则表达式来重命名我的所有数据,以使有机体名称位于最前面,gi编号位于其下,其他所有内容都被删除?

This would probably do what you're asking: 这可能会满足您的要求:

(\w+\|\d+\|)(?:.*\s)(\[\S*)(?:\s)(.+\])

Using \\2\\3\\1 as the replace pattern, $2$3$1 seems to work the same. 使用\\2\\3\\1作为替换模式, $2$3$1似乎相同。

re.sub(r'(\w+\|\d+\|)(?:.*\s)(\[\S*)(?:\s)(.+\])', \2\3\1, line)

example: http://regex101.com/r/aP6lB9 例如: http//regex101.com/r/aP6lB9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM