简体   繁体   中英

Keeping only certain characters in a string using Python?

In my program I have a string like this:

ag ct oso gcota

Using python, my goal is to get rid of the white space and keep only the a,t,c,and g characters. I understand how to get rid of the white space (I'm just using line = line.replace(" ", "")). But how can I get rid of the characters that I don't need when they could be any other letter in the alphabet?

A very elegant and fast way is to use regular expressions:

import re

str = 'ag ct oso gcota'
str = re.sub('[^atcg]', '', str)

"""str is now 'agctgcta"""

I might do something like:

chars_i_want = set('atcg')
final_string = ''.join(c for c in start_string if c in chars_i_want)

This is probably the easiest way to do this.


Another option would be to use str.translate to do the work:

import string
chars_to_remove = string.printable.translate(None,'acgt')
final_string = start_string.translate(None,chars_to_remove)

I'm not sure which would perform better. It'd need to be timed via timeit to know definitively.


update : Timings!

import re
import string

def test_re(s,regex=re.compile('[^atgc]')):
    return regex.sub(s,'')

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s,chars_to_remove = string.printable.translate(None,'acgt')):
    return s.translate(None,chars_to_remove)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func))

Sadly (for me), regex wins on my machine:

test_re 0.901512145996
test_join1 6.00346088409
test_join2 3.66561293602
translate 1.0741918087

Did people test mgilson's test_re() function before upvoting? The arguments to re.sub() are reversed, so it was doing substitution in an empty string, and always returns empty string.

I work in python 3.4; string.translate() only takes one argument, a dict. Because there is overhead in building this dict, I moved it out of the function. To be fair, I also moved the regex compilation out of the function (this didn't make a noticeable difference).

import re
import string

regex=re.compile('[^atgc]')

chars_to_remove = string.printable.translate({ ord('a'): None, ord('c'): None, ord('g'): None, ord('t'): None })
cmap = {}
for c in chars_to_remove:
    cmap[ord(c)] = None

def test_re(s):
    return regex.sub('',s)

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s):
    return s.translate(cmap)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print(func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func)))

Here are the timings:

test_re 3.3141989699797705
test_join1 2.4452173250028864
test_join2 2.081048655003542
translate 1.9390292020107154

It's too bad string.translate() doesn't have an option to control what to do with characters that aren't in the map. The current implementation is to keep them, but we could just as well have the option to remove them, in cases where the characters we want to keep are far fewer than the ones we want to remove (oh hello, unicode).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM