简体   繁体   English

从Python字符串中去除数字和小数点以外的所有内容的最佳方法

[英]Best way to strip everything except numbers and decimal points from a string in Python

I'm reading an ASCII data stream with Python 2.7 that includes non-negative numbers with decimal places but also “garbage characters” that include nonprintables, letters, and punctuation. 我正在使用Python 2.7读取ASCII数据流,该数据流包含带小数位的非负数,以及包含不可打印字符,字母和标点符号的“垃圾字符”。 I can strip out the non-printables this way: 我可以这样删除非打印内容:

rawdata2 = filter(lambda x: x in string.printable, rawdata)

but that leaves a string like such: 但这留下了这样的字符串:

Ri-G2015,2,20.23,9.13,273.1- ZW;w;K-;-A;B`R Ri-G2015,2,20.23,9.13,273.1- ZW; w; K-;-A; B`R

What's a good way to strip out everything except numbers and decimal points (.) so I'm left with this: 什么是去除数字和小数点(。)以外的所有内容的好方法,所以我只能这样做:

2015,2,20.23,9.13,273.1 2015,2,20.23,9.13,273.1

string.printable is just a string. string.printable只是一个字符串。 You can use your own string in its place, like: 您可以在字符串中使用自己的字符串,例如:

rawdata2 = filter(lambda x: x in ',.0123456789', rawdata)

Note that I included a comma, because your expected output also includes commas. 请注意,我包括一个逗号,因为您的预期输出还包括逗号。

A faster approach is to use regular expressions: 一种更快的方法是使用正则表达式:

import re

rawdata2 = re.sub('[^0-9,.]', '', rawdata)

This simply deletes any characters not in the set 0-9 , . 这只会删除不在集合0-9 , .中的任何字符0-9 , . (by replacing them with an empty string). (通过将它们替换为空字符串)。 This is over twice as fast as the filter approach on 100 repetitions of your input string, and is more concise. 这是对100个重复输入字符串进行过滤的两倍快,并且更加简洁。


The fastest approach (if you're processing a lot of text) is to use string.translate : 最快的方法(如果要处理大量文本)是使用string.translate

deltable = "".join(chr(c) for c in xrange(256) if chr(c) not in "0123456789,.")

rawdata2 = string.translate(rawdata, None, deltable)

This is over 100x faster than your original filter approach. 这比原始过滤器方法快100倍以上。

keepchars = string.digits + ",." #the characters you want to keep
rawdata2 = filter(lambda x: x in keepchars, rawdata)

I'd go with this since it seems like you want to whitelist chars. 我会这样做,因为您似乎想将字符列入白名单。 If instead you decide you want to blacklist chars, string.translate() might be a good place to look. 相反,如果您决定将字符列入黑名单,则string.translate()可能是一个不错的地方。

I love regular expressions. 我喜欢正则表达式。 It's elegant,Since i don't know re... 很优雅,因为我不知道...

In [45]: "".join([i for i in mystring if i=="." or i.isdigit() or i==','])
Out[45]: '2015,2,20.23,9.13,273.1'

Thanks everyone. 感谢大家。 My program doesn't need to be speedy since it's processing just one line every few minutes, but I appreciate learning about the efficiency of the different approaches. 我的程序不需要很快,因为它每几分钟只处理一行,但是我很高兴了解到各种方法的效率。 I ended up using the following two lines: 我最终使用了以下两行:

include = set('0' '1' '2' '3' '4' '5' '6' '7' '8' '9' '.' ',')

and then 接着

cleandata1 = ''.join(ch for ch in rawdata if ch in include)

Later I inserted a third line to save the garbage characters for inspection: 后来我插入了第三行以保存垃圾字符以供检查:

garbage = ''.join(ch for ch in rawdata if ch not in include) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM