如何从句子中删除数字和长度小于 2 的单词？

Question

I am trying to remove words that have length below 2 and any word that is numbers.我正在尝试删除长度低于 2 的单词和任何数字单词。 For example例如

 s = " This is a test 1212 test2"

Output desired is所需的输出是

" This is test test2"

I tried \\w{2,} this removes all the word whose length is below 2. When I added \\D+ this removes all numbers when I didn't want to get rid of 2 from test2 .我试过\\w{2,}这会删除所有长度小于 2 的单词。当我添加\\D+这会删除所有数字，因为我不想从test2删除 2 。

Answer 1

You may use:您可以使用：

s = re.sub(r'\b(?:\d+|\w)\b\s*', '', s)

RegEx Demo正则表达式演示

Pattern Details:图案详情：

\\b : Match word boundary \\b : 匹配单词边界
(?:\\d+|\\w) : Match a single word character or 1+ digits (?:\\d+|\\w) : 匹配单个单词字符或 1+ 个数字
\\b : Match word boundary \\b : 匹配单词边界
\\s* : Match 0 or more whitespaces \\s* : 匹配 0 个或多个空格

Answer 2

You can make use of work boundaries '\\b' and remove anything that is 1 character long inside boundaries: number or letter, doesn't matter.您可以使用工作边界'\\b'并删除边界内 1 个字符长的任何内容：数字或字母，无关紧要。 Also remove anything between boundaries that is just numbers:还要删除边界之间的任何只是数字的东西：

import re

s = " This is a test 1212 test2"

print( re.sub(r"\b([^ ]|\d+)\b","",s))

Output:输出：

 This is  test  test2

Explanation:解释：

\b(           word boundary followed by a group
   [^ ]           anything that is not a space (1 character) 
       |              or
        \d+       any amount of numbers
)             followed by another boundary

is replaced by re.sub(pattern, replaceBy, source) with "" .被re.sub(pattern, replaceBy, source)替换为"" 。

Answer 3

Maybe (?i)\\b(?:\\d+|[az])\\b[ \\t]*也许(?i)\\b(?:\\d+|[az])\\b[ \\t]*
https://regex101.com/r/bnS15k/1 https://regex101.com/r/bnS15k/1

Does some wsp trimming.做一些 wsp 修剪。

Whitespace trimming is probably more important for these kind of things.对于这类事情，空白修剪可能更重要。
This modded version does it from both sides.这个修改过的版本从双方做到了。

Just subs it using只需使用
(?im)(?:([ \\t])+\\b(?:\\d+|[az])\\b[ \\t]*[ \\t]*|^\\b(?:\\d+|[az])\\b[ \\t]*[ \\t]*())
With replace \\1\\2用替换\\1\\2

https://regex101.com/r/gSswPe/1 https://regex101.com/r/gSswPe/1
Strips wsp from both sides.从两侧剥离 wsp。

 (?im)
 (?:
    ( [ \t] )+           # (1)
    \b 
    (?: \d+ | [a-z] )
    \b [ \t]* [ \t]* 
  | 
    ^ \b 
    (?: \d+ | [a-z] )
    \b [ \t]* [ \t]* 
    ( )                  # (2)
 )

Answer 4

You can do it like that:你可以这样做：

import re

s = " This is a test 1212 test2"

p = re.compile(r"(\b(\w{0,1})\b)|(\b(\d+)\b)")

result = p.sub('', s)

print(result)

Output:输出：

" This is  test  test2"

I noticed that your desired output does not contain consecutive whitespaces.我注意到您想要的输出不包含连续的空格。 If you want to replace consecutive whitespaces by one, you can do this:如果你想用一个替换连续的空格，你可以这样做：

p = re.compile(r"  +")
result = p.sub(' ', result)

Output:输出：

" This is test test2"

(\\b(\\w{0,1})\\b) this group matches words with length up to 1 (included) (\\b(\\w{0,1})\\b)这个组匹配长度最大为 1 的单词（包括）

(\\b(\\d+)\\b) this group matches word composed of digit(s) only (\\b(\\d+)\\b)这个组只匹配由数字组成的单词

| The pipe means "or", so this expression will match either group 1 or group 2管道表示“或”，因此该表达式将匹配组 1 或组 2

\\b It's the "word boundary". \\b这是“词边界”。 By surrounding some regex with "\\b", it will match "whole words only"通过用“\\b”包围一些正则表达式，它将匹配“仅整个单词”

\\w It will match wharacters supposed to part of a word \\w它将匹配应该是单词的一部分的字符

\\d+ This means "at least one digit or more" \\d+这意味着“至少一位或更多”

Note that what \\b and \\w will match will depends on the regex flavor you are using.请注意， \\b和\\w将匹配什么取决于您使用的正则表达式风格。

Answer 5

Just to put my two cents in - you could use the inbuilt string functions:只是为了投入我的两分钱 - 你可以使用内置的字符串函数：

s = " This is a test 1212 test2"
result = " ".join(word for word in s.split() 
                  if len(word) >= 2 and not word.isdigit())
print(result)

Which would yield哪个会产生

This is test test2

如何从句子中删除数字和长度小于 2 的单词？

问题描述

5 个解决方案

解决方案1
3 已采纳 2020-10-14 18:19:18

解决方案2
1 2020-10-14 18:12:19

解决方案3
1

解决方案4
1 2020-10-14 18:27:48

解决方案5
0 2020-10-14 20:26:36

如何从句子中删除数字和长度小于 2 的单词？

问题描述

5 个解决方案

解决方案1 3 已采纳 2020-10-14 18:19:18

解决方案2 1 2020-10-14 18:12:19

解决方案3 1

解决方案4 1 2020-10-14 18:27:48

解决方案5 0 2020-10-14 20:26:36

解决方案1
3 已采纳 2020-10-14 18:19:18

解决方案2
1 2020-10-14 18:12:19

解决方案3
1

解决方案4
1 2020-10-14 18:27:48

解决方案5
0 2020-10-14 20:26:36