简体   繁体   English

如何去除不需要的字符和字符串?

[英]How do I strip unwanted characters and strings?

I want to strip all unwanted [AZ] characters (among others) except for certain words. 我想去除除某些单词以外的所有不需要的[AZ]字符(以及其他字符)。 For example, we have the following string: 例如,我们有以下字符串:

get 5 and 9

I would like to get rid of all the words that are not 'and' or 'or' so the end result is 5 and 9 . 我想摆脱所有不是'and'或'or'的单词,所以最终结果是5 and 9 I also want to strip out all characters not being part of '[0-9].+-*()<>\\s' too. 我也想删除所有不属于'[0-9]。+-*()<> \\ s'的字符。

The current regular expression works for stripping out all characters but I don't want it to strip out 'and'. 当前的正则表达式适用于去除所有字符,但是我不希望它去除“和”。 In this example, the result would be '5 9'. 在此示例中,结果将为“ 5 9”。

string = 'get 5 and 9'
pattern = re.compile(r'[^0-9\.\+\-\/\*\(\)<>\s)]')
string = re.sub(pattern, '', string)

I am not an expert on regular expressions and struggle to find a solution for this. 我不是正则表达式方面的专家,并且很难为此找到解决方案。 I am kind of lost. 我有点迷路。

Is this possible or should I look for other solutions? 这可能吗?还是我应该寻找其他解决方案?

Revised version 经过修改的版本

import re

test = "get 6 AND 9 or 3 for 6"
keywords = ['and', 'or']
print(' '.join(t for t in test.split() if t.lower() in keywords or t.isdigit()))

$ python test.py
6 AND 9 or 3 6

This rejects words containing and and or, 这将拒绝包含and和or的单词

Previous version. 以前的版本。 This is a pretty simple solution I think, but unfortunately did not work as it picks up 'and' and 'or' in longer words. 我认为这是一个非常简单的解决方案,但不幸的是,由于它用更长的字词表示了“和”和“或”,因此无法正常工作。

import re

test = "get 6 AND 9 or 3"
pattern=re.compile("(?i)(and|or|\d|\s)")
result = re.findall(pattern, test)
print(''.join(result).strip())

$ python test.py
6 AND 9 or 3

Words are case-insensitive because of (?i). 由于(?i),单词不区分大小写。 Spaces are retained with \\s but stripped from beginning and end in the print statement. 空格用\\ s保留,但从print语句的开头和结尾删除。 Digits are retained through \\d. 数字通过\\ d保留。 The parentheses around and|or|\\d|\\s are the bits of the string that are found through findall which generates a list of what has been found, then they are joined back together in the print function. 和|或| \\ d | \\ s周围的括号是通过findall找到的字符串的位,该字符串生成已找到内容的列表,然后在打印功能中将它们重新连接在一起。

An approach without using regular expression 一种不使用正则表达式的方法

input = 'get 5 and 9'

accept_list = ['and', 'or']

output = []
for x in input.split():
    try :
        output.append(str(int(x)))
    except :
        if x in accept_list:
            output.append(x)

print (' '.join(output))

Output 产量

5 and 9 5和9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM