How to use regex in python with varying amounts of white spaces

Question

I'm trying to turn reformat my data here:

gi|492845765|ref|WP_005999719.1| DNA methyltransferase [[Eubacterium] infirmum]

into

[[Eubacterium]infirmum]gi|492845765|

That is, I just want to just keep the gi number and the organism name (with the organism name in front of the gi number), and get rid of the "extra" information (in this case, ref number and "DNA methyltransferase").

I would do re.sub(r"(\\w+ |\\w + |) \\w+|\\w_\\w|\\s\\w+\\s\\w\\s ([.]), \\2\\1, line)

(or something remotely like that)

However, some other lines of my data have more than two words in the "extra" information. example:

gi|548229945|ref|WP_022448665.1| dNA (Cytosine-5-)-methyltransferase [Roseburia sp. CAG:303]

How would I write a regex expression to rename all of my data so that the organism name is in front, the gi numbers next, and everything else deleted?

Answer 1

This would probably do what you're asking:

(\w+\|\d+\|)(?:.*\s)(\[\S*)(?:\s)(.+\])

Using \\2\\3\\1 as the replace pattern, $2$3$1 seems to work the same.

re.sub(r'(\w+\|\d+\|)(?:.*\s)(\[\S*)(?:\s)(.+\])', \2\3\1, line)

example: http://regex101.com/r/aP6lB9

How to use regex in python with varying amounts of white spaces

Question

1 answers

solution1
2 ACCPTED 2014-02-06 22:10:59

How to use regex in python with varying amounts of white spaces

Question

1 answers

solution1 2 ACCPTED 2014-02-06 22:10:59

solution1
2 ACCPTED 2014-02-06 22:10:59