[英]Python - How to delete all characters in a sub string up to and including a keyword
I have a fairly large corpus (500k new articles) in a dataframe in one column.我在一列的数据框中有一个相当大的语料库(50 万篇新文章)。 The beginning of most (tho not all) articles has various random throw away text up to the phrase '(Reuters) - '
大多数(并非全部)文章的开头都有各种随机丢弃的文本,直到短语“(路透社) - ”
I have tried various permutations of the following regex trying to adjust the entire column in one shot with no luck as it either deletes chunks of the articles or does nothing.我尝试了以下正则表达式的各种排列,试图一次性调整整个列,但没有运气,因为它要么删除了大量文章,要么什么都不做。
r = re.compile(r'\A\b.*[Reuters]\b')
reuters3 = reuters2['story_text'].str.replace(r,'', regex=True)
Any ideas on how best to tackle this from a regex and pandas method perspective?关于如何从正则表达式和熊猫方法的角度最好地解决这个问题的任何想法? thank you
谢谢你
Provided below is an example showing the general pattern of the text to be removed at the beginning (up to and including (Reuters) -), to keep in the middle get rid of and to get ride of at the end (everything following and including (Editing by...). The exact language, characters and length varies considerably tho across articles outside of these key cut off words.下面提供了一个例子,显示了在开始时要删除的文本的一般模式(直到并包括(路透社)-),保持在中间摆脱和在结束时摆脱(接下来的所有内容,包括(编辑者...)。在这些关键切断词之外的文章中,确切的语言、字符和长度差异很大。
By Chris Scicluna VALLETTA, Jan 1 (Reuters) - The Mediterranean island of Malta became the smallest member of the euro zone at the stroke of midnight on Tuesday....[various lines of article text]...public information campaign has been a widely acknowledged success.克里斯·西克卢纳 瓦莱塔,1 月 1 日(路透社)——周二午夜时分,地中海岛屿马耳他成为欧元区最小的成员......获得广泛认可的成功。 (Editing by Michael Winfrey) ((gavin.jones@reuters.com; +39-06-8522-4232; Reuters Messaging: gavin.jones.reuters.com@reuters.net)) Keywords: ECB EXPANSION/EURO MALTA
(Michael Winfrey 编辑) ((gavin.jones@reuters.com; +39-06-8522-4232; Reuters Messaging: gavin.jones.reuters.com@reuters.net)) 关键词:ECB EXPANSION/EURO MALTA
If you need to keep the word, you can use如果你需要保留这个词,你可以使用
reuters2['story_text'].str.replace(r'(?s)^.*?(?=\(Reuters\)\s*-)', '')
If you don't need to keep the word, you can use如果你没有需要保留的话,你可以使用
reuters2['story_text'].str.replace(r'(?s)^.*?\(Reuters\)\s*-\s*', '')
Or, use Series.str.split
like this:或者,像这样使用
Series.str.split
:
import pandas as pd
df = pd.DataFrame({'story_text':['Some rubbish ... (Reuters) - Text']})
df['story_text'].str.split(r'\(Reuters\)\s*-', n=1).str[-1]
# => 0 Text
Details细节
(?s)
- DOTALL modifier that makes .
(?s)
- DOTALL 修饰符,使.
match any char^
- start of a string ^
- 字符串的开始.*?
- any 0 or more chars as few as possible \\(Reuters\\)
- a literal (Reuters)
text \\(Reuters\\)
- 文字(Reuters)
文本(?=\\(Reuters\\)\\s*-)
- a positive lookahead that matches a location immediately followed with (Reuters)
, 0+ whitespaces and -
(?=\\(Reuters\\)\\s*-)
- 与紧随其后的位置匹配的正向前瞻(Reuters)
,0+ 空格和-
\\s*-\\s*
- -
enclosed with 0+ whitespaces. \\s*-\\s*
- -
用 0+ 个空格包围。 See the regex demo #1 and regex demo #2 .请参阅正则表达式演示 #1和正则表达式演示 #2 。
The split
solution is using a much simpler regex, \\(Reuters\\)\\s*-
and splits the string into 2 parts (since the n=1
is defined, n
is the number of splits) and .str[-1]
gets the last (second here) item. split
解决方案使用更简单的正则表达式\\(Reuters\\)\\s*-
并将字符串拆分为 2 部分(因为定义了n=1
, n
是拆分的数量)并且.str[-1]
得到最后(这里是第二个)项目。
Just .split()
on it只需
.split()
就可以了
parts = starting_string.split("Reuters", 1) # split at most once
story = parts[-1] # get the last part
example例子
>>> s = "blah blah Reuters bulk of the story"
>>> s.split("Reuters", 1)
['blah blah ', ' bulk of the story']
>>> "missing the newsgroup!".split("Reuters", 1)
['missing the newsgroup!']
>>> ["start", "end"][-1]
'end'
>>> ["bulk without splitword"][-1]
'bulk without splitword'
adding a space or other chars around your split target may help too在拆分目标周围添加空格或其他字符也可能有所帮助
all together:全部一起:
>>> s = "blah blah Reuters bulk of the story"
>>> s.split(" Reuters ", 1)[-1]
'bulk of the story'
You may want to do some extra validation against the possible case that your splitting string isn't simply mentioned somewhere in an article which does not have it in the header.您可能想要针对可能的情况进行一些额外的验证,即您的拆分字符串不是简单地在标题中没有的文章中的某处提及。 Perhaps simply that if there were two parts, the second is longer than the first and at most N characters.
也许简单地说,如果有两个部分,第二个比第一个长,最多 N 个字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.