简体   繁体   English

Python 3正则表达式在某些点之前删除字符

[英]Python 3 regex remove characters before certain point

I have multiple words stored in a list like this: 我在这样的列表中存储了多个单词:

31547   4.7072% i
25109   3.7466% u
20275   3.0253% you
10992   1.6401% me
9490    1.4160% do
7681    1.1461% like
6293    0.9390% want
6225    0.9288% my
5459    0.8145% have
5141    0.7671% your

now i need to cleanse this so that it removes everything before the (i) taking into account that the word will not always be (i) but the format of everything before will be similar. 现在,我需要对此进行清理,以便删除(i)之前的所有内容,同时考虑到单词不一定总是(i),但之前所有内容的格式都将相似。 I have seen other questions that are similar but they needed the word/str to be same every time to work. 我见过其他类似的问题,但每次工作时都需要单词/ str相同。

Thanks in advance for all help and advice, I have tried reading up and doing tutorials on Regex but i do find it quite complex to get your head around. 在此先感谢您提供的所有帮助和建议,我已经尝试过阅读正则表达式并在Regex上进行教程,但我确实发现让您的想法变得非常复杂。

for a similar problem i had i needed to remove everything inside of brackets for which i used: 对于类似的问题,我需要删除使用的方括号内的所有内容:

Cleanse = re.sub('<.*?>', '', line)

but I'm unsure as how to manipulate this to remove everything before the word as I will stress this is my first real time of coming across using regex. 但我不确定如何操作此操作以删除单词之前的所有内容,因为我会强调这是我第一次使用正则表达式。

To process a multiline string, you may use 要处理多行字符串,您可以使用

s = re.sub(r'^\d+[ \t]+\d+\.\d+%[ \t]*', '', s, flags=re.M)

If you process line by line, use 如果逐行处理,请使用

r = re.compile(r'^\d+\s+\d+\.\d+%\s*')
...
s = r.sub('', s)

See the regex demo 正则表达式演示

Pattern explanation : 模式说明

  • ^ - start of a string (or line if re.M flag is passed) ^ -字符串的开头(如果已传递re.M标志,则为行)
  • \\d+ - 1 or more digits \\d+ -1个或更多数字
  • \\s+ - 1 or more whitespaces \\s+ -1个或多个空格
  • \\d+\\.\\d+ - 1+ digits, . \\d+\\.\\d+ -1个以上数字, . , 1+ digits ,1个以上的数字
  • % - a literal % symbol % -文字%符号
  • \\s* - 0+ whitespaces \\s* -0+空格

Note that in a "multiline" version, the [ \\t] is preferable in order to only match horizontal ASCII whitespace. 请注意,在“多行”版本中, [ \\t]是首选,以便仅匹配水平 ASCII空白。 It can also be done with a more sophisticated [^\\S\\r\\n] pattern that is Unicode aware by default in Python 3.x. 也可以使用更复杂的[^\\S\\r\\n]模式来完成,该模式在Python 3.x中默认为Unicode。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM