I have a dataframe of multiple movies containing synopsis.
Title Synopsis
Movie1 Old Macdonald had a farm [Written by ABC rewrite]
Movie2 Wheels on the bus (Source: Melon)
Movie3 Tayo the bus [Produced by Wills Garage]
Movie4 James and Giant Apple (Source: Kismet)
I'd like to remove the trailing words that are not required for NLP such that I get a dataframe below
Title Synopsis
Movie1 Old Macdonald had a farm
Movie2 Wheels on the bus
Movie3 Tayo the bus
Movie4 James and Giant Apple
I've tried the following code but my synopsis column ends up with some string like "0"Iodfosomhgooad,somh...\n1GaBauadFal..." Was wondering if how i could resolve this, appreciate any form of help, thank you.
removelist = [('[Written by]', '') ,('(Source:)', '')]
for old, new in removelist:
df['Synopsis'] = re.sub(old, new, str(df['Synopsis']))
You can use
df['Synopsis'] = df['Synopsis'].str.replace(r'\s*(?:\[[^][]*]|\([^()]*\))\s*$', '')
See the regex demo .
Details :
\s*
- zero or more whitespaces (?:\[[^][]*]|\([^()]*\))
- either
\[[^][]*]
- a [
, any zero or more chars other than [
and ]
and then a ]
char |
- or \([^()]*\)
- a (
, any zero or more chars other than (
and )
and then a )
char \s*
- zero or more whitespaces $
- end of string. You can use the regex replace method directly available to strings in Pandas DataFrames.
data['Synopsis'] = data['Synopsis'].str.replace('\[.*\]$|\(.*\)$','', regex=True)
match anything between [] at end of string
\[.*\]$
multiple string patterns
|
match anything between () at end of string
\(.*\)$
The result of your sample is:
Synopsis
Title
Movie1 Old Macdonald had a farm
Movie2 Wheels on the bus
Movie3 Tayo the bus
Movie4 James and Giant Apple
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.