[英]Deleting all non-letter characters from a string fast, python
I have the following question. 我有以下问题。 I am trying to remove all NON-letter characters from a string (meaning numbers (string.digits), punctuation marks (string.punctuation), non ascii characters (like φ,χ,ψ and so on). This can be done easily with a simple command like:
我正在尝试从字符串中删除所有非字母字符(表示数字(string.digits),标点符号(string.punctuation),非ascii字符(如φ,χ,ψ等),这很容易做到。使用一个简单的命令,例如:
for i in str:
if i not in string.ascii_letters:
data1 = str.replace(i,"")
or by using filter. 或使用过滤器。 However, my problem is that the length of my string is about 20.000.000 (several books concatenated together).
但是,我的问题是我的琴弦的长度约为20.000.000(几本书串联在一起)。 Now in the case of 3.000.000 characters, the above command took about 20 minutes and therefore I did not dare to try it with 20.000.000 characters.
现在,对于3.000.000个字符,上述命令花费了大约20分钟的时间,因此我不敢尝试使用20.000.000个字符。 Can you please tell me if there is any really really fast way to do that?
您能告诉我是否有真正快速的方法吗?
Something like that might eventually improve performances, as you don't have duplicate copy of your very long string in RAM: 这样的事情最终可能会提高性能,因为在RAM中没有非常长的字符串的重复副本:
data1 = (c for c in my_string if c in string.ascii_letters)
YMMV, but on my system, it takes something like 6s to filter a 20MB file containing random bytes (incl. the "".join(...)
operation required to get back a string): YMMV,但在我的系统上,过滤包含随机字节的20MB文件需要6秒钟的时间(包括获取字符串所需的
"".join(...)
操作):
>>> data1 = (c for c in my_string if chr(ord(c)) in string.ascii_letters)
>>> timeit.timeit('"".join(data1)', setup='from __main__ import data1')
5.96341991424560
RegExp substitution took waaaayy much more time: RegExp替换花费了更多时间:
>>> timeit.timeit('re.sub("[^a-zA-Z]","",my_string)', setup='from __main__ import my_string; import re')
... still running after 90+ minutes...
我认为正则表达式就是为这种事情做的...
re.sub("[^a-zA-Z]","",my_string)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.