简体   繁体   English

Python - 如何删除子字符串中直到并包括关键字的所有字符

[英]Python - How to delete all characters in a sub string up to and including a keyword

I have a fairly large corpus (500k new articles) in a dataframe in one column.我在一列的数据框中有一个相当大的语料库(50 万篇新文章)。 The beginning of most (tho not all) articles has various random throw away text up to the phrase '(Reuters) - '大多数(并非全部)文章的开头都有各种随机丢弃的文本,直到短语“(路透社) - ”

I have tried various permutations of the following regex trying to adjust the entire column in one shot with no luck as it either deletes chunks of the articles or does nothing.我尝试了以下正则表达式的各种排列,试图一次性调整整个列,但没有运气,因为它要么删除了大量文章,要么什么都不做。

r = re.compile(r'\A\b.*[Reuters]\b')
reuters3 = reuters2['story_text'].str.replace(r,'', regex=True)

Any ideas on how best to tackle this from a regex and pandas method perspective?关于如何从正则表达式和熊猫方法的角度最好地解决这个问题的任何想法? thank you谢谢你

Provided below is an example showing the general pattern of the text to be removed at the beginning (up to and including (Reuters) -), to keep in the middle get rid of and to get ride of at the end (everything following and including (Editing by...). The exact language, characters and length varies considerably tho across articles outside of these key cut off words.下面提供了一个例子,显示了在开始时要删除的文本的一般模式(直到并包括(路透社)-),保持在中间摆脱和在结束时摆脱(接下来的所有内容,包括(编辑者...)。在这些关键切断词之外的文章中,确切的语言、字符和长度差异很大。

By Chris Scicluna VALLETTA, Jan 1 (Reuters) - The Mediterranean island of Malta became the smallest member of the euro zone at the stroke of midnight on Tuesday....[various lines of article text]...public information campaign has been a widely acknowledged success.克里斯·西克卢纳 瓦莱塔,1 月 1 日(路透社)——周二午夜时分,地中海岛屿马耳他成为欧元区最小的成员......获得广泛认可的成功。 (Editing by Michael Winfrey) ((gavin.jones@reuters.com; +39-06-8522-4232; Reuters Messaging: gavin.jones.reuters.com@reuters.net)) Keywords: ECB EXPANSION/EURO MALTA (Michael Winfrey 编辑) ((gavin.jones@reuters.com; +39-06-8522-4232; Reuters Messaging: gavin.jones.reuters.com@reuters.net)) 关键词:ECB EXPANSION/EURO MALTA

If you need to keep the word, you can use如果你需要保留这个词,你可以使用

reuters2['story_text'].str.replace(r'(?s)^.*?(?=\(Reuters\)\s*-)', '')

If you don't need to keep the word, you can use如果你没有需要保留的话,你可以使用

reuters2['story_text'].str.replace(r'(?s)^.*?\(Reuters\)\s*-\s*', '')

Or, use Series.str.split like this:或者,像这样使用Series.str.split

import pandas as pd
df = pd.DataFrame({'story_text':['Some rubbish ... (Reuters) - Text']})
df['story_text'].str.split(r'\(Reuters\)\s*-', n=1).str[-1]
# => 0     Text

Details细节

  • (?s) - DOTALL modifier that makes . (?s) - DOTALL 修饰符,使. match any char匹配任何字符
  • ^ - start of a string ^ - 字符串的开始
  • .*? - any 0 or more chars as few as possible - 尽可能少的任何 0 个或更多字符
  • \\(Reuters\\) - a literal (Reuters) text \\(Reuters\\) - 文字(Reuters)文本
  • (?=\\(Reuters\\)\\s*-) - a positive lookahead that matches a location immediately followed with (Reuters) , 0+ whitespaces and - (?=\\(Reuters\\)\\s*-) - 与紧随其后的位置匹配的正向前瞻(Reuters) ,0+ 空格和-
  • \\s*-\\s* - - enclosed with 0+ whitespaces. \\s*-\\s* - -用 0+ 个空格包围。

See the regex demo #1 and regex demo #2 .请参阅正则表达式演示 #1正则表达式演示 #2

The split solution is using a much simpler regex, \\(Reuters\\)\\s*- and splits the string into 2 parts (since the n=1 is defined, n is the number of splits) and .str[-1] gets the last (second here) item. split解决方案使用更简单的正则表达式\\(Reuters\\)\\s*-并将字符串拆分为 2 部分(因为定义了n=1n是拆分的数量)并且.str[-1]得到最后(这里是第二个)项目。

Just .split() on it只需.split()就可以了

parts = starting_string.split("Reuters", 1)  # split at most once
story = parts[-1]  # get the last part

example例子

>>> s = "blah blah Reuters bulk of the story"
>>> s.split("Reuters", 1)
['blah blah ', ' bulk of the story']
>>> "missing the newsgroup!".split("Reuters", 1)
['missing the newsgroup!']
>>> ["start", "end"][-1]
'end'
>>> ["bulk without splitword"][-1]
'bulk without splitword'

adding a space or other chars around your split target may help too在拆分目标周围添加空格或其他字符也可能有所帮助

all together:全部一起:

>>> s = "blah blah Reuters bulk of the story"
>>> s.split(" Reuters ", 1)[-1]
'bulk of the story'

You may want to do some extra validation against the possible case that your splitting string isn't simply mentioned somewhere in an article which does not have it in the header.您可能想要针对可能的情况进行一些额外的验证,即您的拆分字符串不是简单地在标题中没有的文章中的某处提及。 Perhaps simply that if there were two parts, the second is longer than the first and at most N characters.也许简单地说,如果有两个部分,第二个比第一个长,最多 N 个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除python中首次出现关键字的字符串? - How to delete string up to first appearance of keyword in python? 如何对字符串中的所有字符(包括安全字符)进行 urlencode? - How do I urlencode all the characters in a string, including safe characters? 如何检查字符串是否是有效的python标识符? 包括关键字检查? - How to check if a string is a valid python identifier? including keyword check? 删除多行字符串中的所有字符,直到给定的模式 - Delete all characters in a multiline string up to a given pattern 如何通过python删除文本文件中的所有字符 - How to delete all characters in text file by python 如何从给定的字符串中提取包括换行符(\\n)在内的所有字符 - How to extract all the characters including linefeeds(\n) from a given string 删除字符串Python中的字符 - Delete characters in a string Python Python RegEx:如何删除给定字符左侧并包括给定字符的所有字符 - Python RegEx: How to remove all characters to the left of and including a given character 如何从字符串(python)中获取关键字后面的字符? - How to get the characters following the keyword from a string (python)? 如何从 python 中的字符串中删除一组字符,包括移位 - How to remove a group of characters from a string in python including shifting
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM