简体   繁体   English

使用Python只保留字符串中的某些字符?

[英]Keeping only certain characters in a string using Python?

In my program I have a string like this: 在我的程序中,我有一个这样的字符串:

ag ct oso gcota ag ct oso gcota

Using python, my goal is to get rid of the white space and keep only the a,t,c,and g characters. 使用python,我的目标是摆脱空白区域,只保留a,t,c和g字符。 I understand how to get rid of the white space (I'm just using line = line.replace(" ", "")). 我理解如何摆脱空白区域(我只是使用line = line.replace(“”,“”))。 But how can I get rid of the characters that I don't need when they could be any other letter in the alphabet? 但是,如果它们可能是字母表中的任何其他字母,我怎么能摆脱我不需要的字符呢?

A very elegant and fast way is to use regular expressions: 一种非常优雅和快速的方法是使用正则表达式:

import re

str = 'ag ct oso gcota'
str = re.sub('[^atcg]', '', str)

"""str is now 'agctgcta"""

I might do something like: 我可能会这样做:

chars_i_want = set('atcg')
final_string = ''.join(c for c in start_string if c in chars_i_want)

This is probably the easiest way to do this. 这可能是最简单的方法。


Another option would be to use str.translate to do the work: 另一个选择是使用str.translate来完成工作:

import string
chars_to_remove = string.printable.translate(None,'acgt')
final_string = start_string.translate(None,chars_to_remove)

I'm not sure which would perform better. 我不确定哪个会表现得更好。 It'd need to be timed via timeit to know definitively. 它需要通过timeit定时才能明确地知道。


update : Timings! 更新 :时间!

import re
import string

def test_re(s,regex=re.compile('[^atgc]')):
    return regex.sub(s,'')

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s,chars_to_remove = string.printable.translate(None,'acgt')):
    return s.translate(None,chars_to_remove)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func))

Sadly (for me), regex wins on my machine: 可悲的是(对我来说), regex赢得了我的机器:

test_re 0.901512145996
test_join1 6.00346088409
test_join2 3.66561293602
translate 1.0741918087

Did people test mgilson's test_re() function before upvoting? 在upvoting之前,人们是否测试了mgilson的test_re()函数? The arguments to re.sub() are reversed, so it was doing substitution in an empty string, and always returns empty string. re.sub()的参数是相反的,所以它在空字符串中进行替换,并且总是返回空字符串。

I work in python 3.4; 我在python 3.4中工作; string.translate() only takes one argument, a dict. string.translate()只接受一个参数,一个字典。 Because there is overhead in building this dict, I moved it out of the function. 因为构建这个dict有开销,所以我把它移出了函数。 To be fair, I also moved the regex compilation out of the function (this didn't make a noticeable difference). 公平地说,我还将正则表达式编译移出了函数(这没有明显的区别)。

import re
import string

regex=re.compile('[^atgc]')

chars_to_remove = string.printable.translate({ ord('a'): None, ord('c'): None, ord('g'): None, ord('t'): None })
cmap = {}
for c in chars_to_remove:
    cmap[ord(c)] = None

def test_re(s):
    return regex.sub('',s)

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s):
    return s.translate(cmap)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print(func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func)))

Here are the timings: 以下是时间安排:

test_re 3.3141989699797705
test_join1 2.4452173250028864
test_join2 2.081048655003542
translate 1.9390292020107154

It's too bad string.translate() doesn't have an option to control what to do with characters that aren't in the map. 这太糟糕了,string.translate()没有选项来控制如何处理不在地图中的字符。 The current implementation is to keep them, but we could just as well have the option to remove them, in cases where the characters we want to keep are far fewer than the ones we want to remove (oh hello, unicode). 当前的实现是保留它们,但是我们也可以选择删除它们,如果我们想要保留的字符远远少于我们要删除的字符(哦,你好,unicode)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Python 中的字符串中只允许数字、字母和某些字符? - How to only allow digits, letters, and certain characters in a string in Python? 在 Python 中,如何检查字符串是否只包含某些字符? - In Python, how to check if a string only contains certain characters? 检查字符串的“alpha”部分是否仅包含特定的字符序列 - python - Check that the "alpha" part of a string consists of only a certain sequence of characters - python 遍历字符串,仅返回某些字符。 蟒蛇 - Looping through a string and only returning certain characters. Python 生成一个随机字符串,仅允许某些字符在python中重复 - generate a random string with only certain characters allowed to repeat in python 仅当字符在 python 中以特定顺序出现时才从字符串中删除字符 - Remove characters from a string only when they occur in a certain order in python 如何对字符串中的字符进行排序,但只保留撇号 - How to sort characters in string but keeping only apostrophe 使用 PyInputPlus 将字符串限制为特定长度并仅允许某些字符 - Using PyInputPlus to restrict a string to specific length and allow only certain characters Python:拆分字符串并保持字符拆分 - Python: Splitting a String and Keeping Characters Split On 如何使用python仅打印一行中的某些字符 - How to print only certain characters from a line using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM