简体   繁体   中英

Best way to remove specific words from column in pandas dataframe?

I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?

Here is an example of what the column contains:

Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...

So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.

I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|"characters, but if anyone has suggestions on a simpler way to remove the genres and "|"key, please help me out.

If the titles were always in a consistent location, say 3rd in the list, then we would not need a list of genres. We could use Series.str.split with expand=True and get the 3rd column (index 2):

df['Book Labels'] = df['Book Labels'].str.split('|', expand=True)[2]

However, since your sample shows that the title is not in a consistent location, I'd go with your idea:

create a list of all the genres (as there are only so many) and remove those from the column along with the "|"characters

Use Series.replace to remove the genres and Series.str.strip to strip the separators:

genres = ['Science Fiction', 'Drama', 'Thriller', 'Mystery', 'Family', 'Comedy']
df['Book Labels'] = df['Book Labels'].replace('|'.join(genres), '', regex=True).str.strip('| ')

#                             Book Labels
# 0                                  Dune
# 1                        The Day I Died
# 2                      Razorblade Tears
# 3  How To Marry Keanu Reeves In 90 Days

It is an enhancement to @tdy Regex solution. The original regex Family|Drama will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes , the words will be removed as well.

Supposed that the labels are separated by " | ", there are three match conditions we want to remove.

  1. Gerne at start of string. eg Drama |...
  2. Gerne in the middle. eg ... | Drama |... ... | Drama |...
  3. Gerne at end of string. eg ... | Drama ... | Drama

Use regex (^|\| )(?:Family|Drama)(?=( \||$)) to match one of three conditions. Note that | Drama | Family | Drama | Family | Drama | Family has 2 overlapped matches, here I use ?=( \||$) to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.

>>> genres = ["Family", "Drama"]

>>> df

#                       Book Labels
# 0      Drama | Drama 123 | Family
# 1      Drama 123 | Drama | Family
# 2      Drama | Family | Drama 123
# 3  123 Drama 123 | Family | Drama
# 4      Drama | Family | 123 Drama

>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))

>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)

# 0       | Drama 123
# 1        Drama 123
# 2        | Drama 123
# 3    123 Drama 123
# 4        | 123 Drama

>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")

# 0        Drama 123
# 1        Drama 123
# 2        Drama 123
# 3    123 Drama 123
# 4        123 Drama

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM