简体   繁体   中英

How to use regex in python with varying amounts of white spaces

I'm trying to turn reformat my data here:

gi|492845765|ref|WP_005999719.1| DNA methyltransferase [[Eubacterium] infirmum]

into

[[Eubacterium]infirmum]gi|492845765|

That is, I just want to just keep the gi number and the organism name (with the organism name in front of the gi number), and get rid of the "extra" information (in this case, ref number and "DNA methyltransferase").

I would do re.sub(r"(\\w+ |\\w + |) \\w+|\\w_\\w|\\s\\w+\\s\\w\\s ([.]), \\2\\1, line)

(or something remotely like that)

However, some other lines of my data have more than two words in the "extra" information. example:

gi|548229945|ref|WP_022448665.1| dNA (Cytosine-5-)-methyltransferase [Roseburia sp. CAG:303]

How would I write a regex expression to rename all of my data so that the organism name is in front, the gi numbers next, and everything else deleted?

This would probably do what you're asking:

(\w+\|\d+\|)(?:.*\s)(\[\S*)(?:\s)(.+\])

Using \\2\\3\\1 as the replace pattern, $2$3$1 seems to work the same.

re.sub(r'(\w+\|\d+\|)(?:.*\s)(\[\S*)(?:\s)(.+\])', \2\3\1, line)

example: http://regex101.com/r/aP6lB9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM