[英]How can I remove numbers, and words with length below 2, from a sentence?
I am trying to remove words that have length below 2 and any word that is numbers.我正在尝试删除长度低于 2 的单词和任何数字单词。 For example
例如
s = " This is a test 1212 test2"
Output desired is所需的输出是
" This is test test2"
I tried \\w{2,}
this removes all the word whose length is below 2. When I added \\D+
this removes all numbers when I didn't want to get rid of 2 from test2
.我试过
\\w{2,}
这会删除所有长度小于 2 的单词。当我添加\\D+
这会删除所有数字,因为我不想从test2
删除 2 。
You may use:您可以使用:
s = re.sub(r'\b(?:\d+|\w)\b\s*', '', s)
Pattern Details:图案详情:
\\b
: Match word boundary \\b
: 匹配单词边界(?:\\d+|\\w)
: Match a single word character or 1+ digits (?:\\d+|\\w)
: 匹配单个单词字符或 1+ 个数字\\b
: Match word boundary \\b
: 匹配单词边界\\s*
: Match 0 or more whitespaces \\s*
: 匹配 0 个或多个空格You can make use of work boundaries '\\b'
and remove anything that is 1 character long inside boundaries: number or letter, doesn't matter.您可以使用工作边界
'\\b'
并删除边界内 1 个字符长的任何内容:数字或字母,无关紧要。 Also remove anything between boundaries that is just numbers:还要删除边界之间的任何只是数字的东西:
import re
s = " This is a test 1212 test2"
print( re.sub(r"\b([^ ]|\d+)\b","",s))
Output:输出:
This is test test2
Explanation:解释:
\b( word boundary followed by a group
[^ ] anything that is not a space (1 character)
| or
\d+ any amount of numbers
) followed by another boundary
is replaced by re.sub(pattern, replaceBy, source)
with ""
.被
re.sub(pattern, replaceBy, source)
替换为""
。
Maybe (?i)\\b(?:\\d+|[az])\\b[ \\t]*
也许
(?i)\\b(?:\\d+|[az])\\b[ \\t]*
https://regex101.com/r/bnS15k/1 https://regex101.com/r/bnS15k/1
Does some wsp trimming.做一些 wsp 修剪。
Whitespace trimming is probably more important for these kind of things.对于这类事情,空白修剪可能更重要。
This modded version does it from both sides.这个修改过的版本从双方做到了。
Just subs it using只需使用
(?im)(?:([ \\t])+\\b(?:\\d+|[az])\\b[ \\t]*[ \\t]*|^\\b(?:\\d+|[az])\\b[ \\t]*[ \\t]*())
With replace \\1\\2
用替换
\\1\\2
https://regex101.com/r/gSswPe/1 https://regex101.com/r/gSswPe/1
Strips wsp from both sides.从两侧剥离 wsp。
(?im)
(?:
( [ \t] )+ # (1)
\b
(?: \d+ | [a-z] )
\b [ \t]* [ \t]*
|
^ \b
(?: \d+ | [a-z] )
\b [ \t]* [ \t]*
( ) # (2)
)
You can do it like that:你可以这样做:
import re
s = " This is a test 1212 test2"
p = re.compile(r"(\b(\w{0,1})\b)|(\b(\d+)\b)")
result = p.sub('', s)
print(result)
Output:输出:
" This is test test2"
I noticed that your desired output does not contain consecutive whitespaces.我注意到您想要的输出不包含连续的空格。 If you want to replace consecutive whitespaces by one, you can do this:
如果你想用一个替换连续的空格,你可以这样做:
p = re.compile(r" +")
result = p.sub(' ', result)
Output:输出:
" This is test test2"
(\\b(\\w{0,1})\\b)
this group matches words with length up to 1 (included) (\\b(\\w{0,1})\\b)
这个组匹配长度最大为 1 的单词(包括)
(\\b(\\d+)\\b)
this group matches word composed of digit(s) only (\\b(\\d+)\\b)
这个组只匹配由数字组成的单词
|
The pipe means "or", so this expression will match either group 1 or group 2管道表示“或”,因此该表达式将匹配组 1 或组 2
\\b
It's the "word boundary". \\b
这是“词边界”。 By surrounding some regex with "\\b", it will match "whole words only"通过用“\\b”包围一些正则表达式,它将匹配“仅整个单词”
\\w
It will match wharacters supposed to part of a word \\w
它将匹配应该是单词的一部分的字符
\\d+
This means "at least one digit or more" \\d+
这意味着“至少一位或更多”
Note that what \\b
and \\w
will match will depends on the regex flavor you are using.请注意,
\\b
和\\w
将匹配什么取决于您使用的正则表达式风格。
Just to put my two cents in - you could use the inbuilt string functions:只是为了投入我的两分钱 - 你可以使用内置的字符串函数:
s = " This is a test 1212 test2"
result = " ".join(word for word in s.split()
if len(word) >= 2 and not word.isdigit())
print(result)
Which would yield哪个会产生
This is test test2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.