简体   繁体   English

如何使用正则表达式或替换来清理列表?

[英]How to use regex or replace to clean up list?

This seems like a very obvious mistake which I have been trying to solve for almost an hour now.这似乎是一个非常明显的错误,我已经尝试解决了将近一个小时。 :( :(

lst = ['\xa0\xa0+11-9188882266\xa0\xa0+01-9736475634 ','\xa0\xa0+11-9177772266\xa0\xa0+01-9736475234']

I am trying to grab numbers, hyphens and the + sign only.我正在尝试仅获取数字、连字符和 + 号。 Basically remove all the \\xa0 .基本上删除所有\\xa0

I thought that Regex would be the right way to go about it.我认为Regex将是解决它的正确方法。 Tried it and failed:试过了,失败了:

mRegex = (['+0-9-'])
lst = re.match(mRegex,lst)

Traceback (most recent call last): File "", line 1, in File "C:\\Python34\\lib\\re.py", line 160, in match return _compile(pattern, flags).match(string) File "C:\\Python34\\lib\\re.py", line 282, in _compile p, loc = _cache[type(pattern), pattern, flags] TypeError: unhashable type: 'list'回溯(最近一次调用最后一次):文件“”,第 1 行,在文件“C:\\Python34\\lib\\re.py”中,第 160 行,在匹配中 return _compile(pattern, flags).m​​atch(string) File "C :\\Python34\\lib\\re.py", line 282, in _compile p, loc = _cache[type(pattern), pattern, flags] TypeError: unhashable type: 'list'

I gave it a few more tries with regex then switched to replace :我用regex了几次,然后切换到replace

h.replace(r"\\xa0","")

It doesn't do anything to the lst .它对lst没有任何作用。 Stays exactly the same.保持完全相同。

When I do a len(lst[0]) I get 33 which is very odd.当我执行len(lst[0])我得到33 ,这很奇怪。

In a:在一个:

for i in lst[0]:
    print(i)

the output doesn't show \\xa0 .输出不显示\\xa0

I am completely confused here.我在这里完全困惑。

first, you cannot apply replacement/regex on a list.首先,您不能在列表上应用替换/正则表达式。 You have to apply them for each string, and use a list comprehension to rebuild the cleaned-up list.您必须为每个字符串应用它们,并使用列表理解来重建清理后的列表。

second, when you replace you're using the raw prefix, when you shouldn't use it, since it treats \\x literally, not that you want.其次,当您替换时,您使用的是原始前缀,而您不应该使用它,因为它按字面意思处理\\x ,而不是您想要的。

I'd do:我会做:

lst = [x.replace("\xa0","") for x in lst]

results in:结果是:

['+11-9188882266+01-9736475634 ', '+11-9177772266+01-9736475234']

and BTW: mRegex = (['+0-9-']) doesn't work because you're basically defining a list of 1 string.顺便说一句: mRegex = (['+0-9-'])不起作用,因为您基本上定义了一个包含 1 个字符串的列表。 You probably meant mRegex = '([0-9\\-+])'你可能的意思是mRegex = '([0-9\\-+])'

A regex solution would be:正则表达式解决方案是:

lst = [re.sub(r"[^\d+\-]","",x) for x in lst]

(removes chars not matching the char class, and \\d is (roughly) equivalent to 0-9 ) (删除与 char 类不匹配的字符,并且\\d (大致)相当于0-9

After a few years I realize (after reading OP comment properly this time) that the expected result is probably the numbers separated in a list, so removing \\xa0 isn't a good idea, because it collates the numbers.几年后我意识到(这次正确阅读 OP 评论后)预期的结果可能是列表中分隔的数字,因此删除\\xa0不是一个好主意,因为它整理了数字。 Let's just use split on each string:让我们在每个字符串上使用split

>>> lst = ['\xa0\xa0+11-9188882266\xa0\xa0+01-9736475634 ','\xa0\xa0+11-9177772266\xa0\xa0+01-9736475234']
>>> [x.split() for x in lst]
[['+11-9188882266', '+01-9736475634'], ['+11-9177772266', '+01-9736475234']]

Actually using split() works because \\xa0 is seen as a space character (windows uses it for instance), and also removes multiple instances of spaces, so the result is given straight away without further hassle.实际上使用split()有效的,因为\\xa0被视为一个空格字符(例如,windows 使用它),并且还删除了多个空格实例,因此结果可以直接给出而不会再麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM