使用Python只保留字符串中的某些字符？

Question

In my program I have a string like this: 在我的程序中，我有一个这样的字符串：

ag ct oso gcota ag ct oso gcota

Using python, my goal is to get rid of the white space and keep only the a,t,c,and g characters. 使用python，我的目标是摆脱空白区域，只保留a，t，c和g字符。 I understand how to get rid of the white space (I'm just using line = line.replace(" ", "")). 我理解如何摆脱空白区域（我只是使用line = line.replace（“”，“”））。 But how can I get rid of the characters that I don't need when they could be any other letter in the alphabet? 但是，如果它们可能是字母表中的任何其他字母，我怎么能摆脱我不需要的字符呢？

Answer 1

A very elegant and fast way is to use regular expressions: 一种非常优雅和快速的方法是使用正则表达式：

import re

str = 'ag ct oso gcota'
str = re.sub('[^atcg]', '', str)

"""str is now 'agctgcta"""

Answer 2

I might do something like: 我可能会这样做：

chars_i_want = set('atcg')
final_string = ''.join(c for c in start_string if c in chars_i_want)

This is probably the easiest way to do this. 这可能是最简单的方法。

Another option would be to use str.translate to do the work: 另一个选择是使用str.translate来完成工作：

import string
chars_to_remove = string.printable.translate(None,'acgt')
final_string = start_string.translate(None,chars_to_remove)

I'm not sure which would perform better. 我不确定哪个会表现得更好。 It'd need to be timed via timeit to know definitively. 它需要通过timeit定时才能明确地知道。

update : Timings! 更新：时间！

import re
import string

def test_re(s,regex=re.compile('[^atgc]')):
    return regex.sub(s,'')

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s,chars_to_remove = string.printable.translate(None,'acgt')):
    return s.translate(None,chars_to_remove)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func))

Sadly (for me), regex wins on my machine: 可悲的是（对我来说）， regex赢得了我的机器：

test_re 0.901512145996
test_join1 6.00346088409
test_join2 3.66561293602
translate 1.0741918087

Answer 3

Did people test mgilson's test_re() function before upvoting? 在upvoting之前，人们是否测试了mgilson的test_re（）函数？ The arguments to re.sub() are reversed, so it was doing substitution in an empty string, and always returns empty string. re.sub（）的参数是相反的，所以它在空字符串中进行替换，并且总是返回空字符串。

I work in python 3.4; 我在python 3.4中工作; string.translate() only takes one argument, a dict. string.translate（）只接受一个参数，一个字典。 Because there is overhead in building this dict, I moved it out of the function. 因为构建这个dict有开销，所以我把它移出了函数。 To be fair, I also moved the regex compilation out of the function (this didn't make a noticeable difference). 公平地说，我还将正则表达式编译移出了函数（这没有明显的区别）。

import re
import string

regex=re.compile('[^atgc]')

chars_to_remove = string.printable.translate({ ord('a'): None, ord('c'): None, ord('g'): None, ord('t'): None })
cmap = {}
for c in chars_to_remove:
    cmap[ord(c)] = None

def test_re(s):
    return regex.sub('',s)

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s):
    return s.translate(cmap)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print(func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func)))

Here are the timings: 以下是时间安排：

test_re 3.3141989699797705
test_join1 2.4452173250028864
test_join2 2.081048655003542
translate 1.9390292020107154

It's too bad string.translate() doesn't have an option to control what to do with characters that aren't in the map. 这太糟糕了，string.translate（）没有选项来控制如何处理不在地图中的字符。 The current implementation is to keep them, but we could just as well have the option to remove them, in cases where the characters we want to keep are far fewer than the ones we want to remove (oh hello, unicode). 当前的实现是保留它们，但是我们也可以选择删除它们，如果我们想要保留的字符远远少于我们要删除的字符（哦，你好，unicode）。

使用Python只保留字符串中的某些字符？

问题描述

3 个解决方案

解决方案1
15 已采纳 2013-04-02 01:27:17

解决方案2
4 2013-04-02 01:21:55

解决方案3
2 2014-10-18 18:30:15

使用Python只保留字符串中的某些字符？

问题描述

3 个解决方案

解决方案1 15 已采纳 2013-04-02 01:27:17

解决方案2 4 2013-04-02 01:21:55

解决方案3 2 2014-10-18 18:30:15

解决方案1
15 已采纳 2013-04-02 01:27:17

解决方案2
4 2013-04-02 01:21:55

解决方案3
2 2014-10-18 18:30:15