简体   繁体   English

快速删除字符串中的所有非字母字符

[英]Deleting all non-letter characters from a string fast, python

I have the following question. 我有以下问题。 I am trying to remove all NON-letter characters from a string (meaning numbers (string.digits), punctuation marks (string.punctuation), non ascii characters (like φ,χ,ψ and so on). This can be done easily with a simple command like: 我正在尝试从字符串中删除所有非字母字符(表示数字(string.digits),标点符号(string.punctuation),非ascii字符(如φ,χ,ψ等),这很容易做到。使用一个简单的命令,例如:

for i in str:
    if i not in string.ascii_letters: 
        data1 = str.replace(i,"")

or by using filter. 或使用过滤器。 However, my problem is that the length of my string is about 20.000.000 (several books concatenated together). 但是,我的问题是我的琴弦的长度约为20.000.000(几本书串联在一起)。 Now in the case of 3.000.000 characters, the above command took about 20 minutes and therefore I did not dare to try it with 20.000.000 characters. 现在,对于3.000.000个字符,上述命令花费了大约20分钟的时间,因此我不敢尝试使用20.000.000个字符。 Can you please tell me if there is any really really fast way to do that? 您能告诉我是否有真正快速的方法吗?

Something like that might eventually improve performances, as you don't have duplicate copy of your very long string in RAM: 这样的事情最终可能提高性能,因为在RAM中没有非常长的字符串的重复副本:

data1 = (c for c in my_string if c in string.ascii_letters)

YMMV, but on my system, it takes something like 6s to filter a 20MB file containing random bytes (incl. the "".join(...) operation required to get back a string): YMMV,但在我的系统上,过滤包含随机字节的20MB文件需要6秒钟的时间(包括获取字符串所需的"".join(...)操作):

>>> data1 = (c for c in my_string if chr(ord(c)) in string.ascii_letters)
>>> timeit.timeit('"".join(data1)', setup='from __main__ import data1')
5.96341991424560

RegExp substitution took waaaayy much more time: RegExp替换花费了更多时间:

>>> timeit.timeit('re.sub("[^a-zA-Z]","",my_string)', setup='from __main__ import my_string; import re')
... still running after 90+ minutes...

我认为正则表达式就是为这种事情做的...

re.sub("[^a-zA-Z]","",my_string)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Python 中带重音的字符串中删除所有非字母字符 - Removing all non-letter chars from a string with accents in Python 如何从字符串中删除所有非字母(所有语言)和非数字字符? - How can I remove all non-letter (all languages) and non-numeric characters from a string? Python:如何忽略非字母字符并将所有字母字符都视为小写? - Python: How to ignore non-letter characters and treat all alphabetic characters as lower case? 从文本文件中删除所有标点符号、空格和其他非字母字符,包括数字 - Removing all punctuation, spaces and other non-letter characters including numbers from a text file 从单词的开头和结尾删除非字母字符 - Remove non-letter characters from beginning and end of a word 如何检查字符串中的字符是否为非字母? - How do I check if a character in a string is a non-letter? Django 3:过滤字母或非字母的查询集 - Django 3: Filter queryset for letter or non-letter 从 Python 中的字符串中删除字符 - Deleting characters from a string in Python 从 Python 中的字符串中删除所有非数字字符 - Removing all non-numeric characters from string in Python 从Python中的字符串中删除所有非数字字符(“。”除外) - Strip all non-numeric characters (except for “.”) from a string in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM