I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?
Here is an example of what the column contains:
Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...
So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.
I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|"characters, but if anyone has suggestions on a simpler way to remove the genres and "|"key, please help me out.
If the titles were always in a consistent location, say 3rd in the list, then we would not need a list of genres. We could use Series.str.split
with expand=True
and get the 3rd column (index 2):
df['Book Labels'] = df['Book Labels'].str.split('|', expand=True)[2]
However, since your sample shows that the title is not in a consistent location, I'd go with your idea:
create a list of all the genres (as there are only so many) and remove those from the column along with the "|"characters
Use Series.replace
to remove the genres and Series.str.strip
to strip the separators:
genres = ['Science Fiction', 'Drama', 'Thriller', 'Mystery', 'Family', 'Comedy']
df['Book Labels'] = df['Book Labels'].replace('|'.join(genres), '', regex=True).str.strip('| ')
# Book Labels
# 0 Dune
# 1 The Day I Died
# 2 Razorblade Tears
# 3 How To Marry Keanu Reeves In 90 Days
It is an enhancement to @tdy Regex solution. The original regex Family|Drama
will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes
, the words will be removed as well.
Supposed that the labels are separated by " | ", there are three match conditions we want to remove.
Drama |...
... | Drama |...
... | Drama |...
... | Drama
... | Drama
Use regex (^|\| )(?:Family|Drama)(?=( \||$))
to match one of three conditions. Note that | Drama | Family
| Drama | Family
| Drama | Family
has 2 overlapped matches, here I use ?=( \||$)
to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.
>>> genres = ["Family", "Drama"]
>>> df
# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama
>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))
>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)
# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama
>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")
# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.