简体   繁体   中英

Comparing strings in two columns to produce new column

I have a dataframe with two columns: "names" (~10 characters per entry) and "articles" (~20,000 characters per entry).

  Names                        | Articles
 ----------------------------------------------------------------------------------------------
| ['Craig Johnson']            | In the news yesterday, there were 80 reports of arson, and that's just the start of it...
 ----------------------------------------------------------------------------------------------
| ['Jim Billy', 'Bob Cob']     | In the news today, there were 81 reports of arson. Things are heating up...
 ----------------------------------------------------------------------------------------------
| ['Darth Vadar']              | The Death Star has proven itself to be a top spot for bowling nights and...
 ----------------------------------------------------------------------------------------------

I need to iterate through each row, and determine when a last name from the "names" column appears in the last 20 characters of its row's respective "Articles" column.

I also need to check every "Articles" row to see if the word "Footer" appears.

If either a last name or the word "Footer" appears in the last twenty characters of any given "Article," then I need to create a new column, "doctored_articles," wherein the last twenty characters of the article are cut short at the earliest instance of the last name or the string "Footer."

If neither "Footer" nor the last name appears in the last 20 characters, then the "doctored_articles" entry should be the same as its "Articles" entry.

I'm not sure what the best way to approach the row iteration and comparison bit, and would really appreciate any help. Thank you so much in advance!

Sample Case:

  Names                        | Articles
 ----------------------------------------------------------------------------------------------
| ['Craig Johnson']            | Craig Johnson: In the news yesterday, there were 80 reports of arson, and that's just the start of it. Yada...yada...yada...yada...yada...yada...yada...This article was written by C. Johnson, footer, ok
 ----------------------------------------------------------------------------------------------

Expected Output Column:

  Names                        | Doctored_Article
 ----------------------------------------------------------------------------------------------
| ['Craig Johnson']            | Craig Johnson: In the news yesterday, there were 80 reports of arson, and that's just the start of it. Yada...yada...yada...yada...yada...yada...yada...This article was written by C. 
 ----------------------------------------------------------------------------------------------

You can explode the column Names , create a mask with zip and finally agg the results back together:

df = pd.DataFrame({"Names":[['Craig Johnson'],['Jim Billy', 'Bob Cob'],['Darth Vader']],
                   "Articles":["In the news yesterday, there were 80 reports of arson, and that's just the start of it...",
                               "In the news today, there were 81 reports of arson. Things are heating up",
                               "The Death Star has proven itself to be a top spot for bowling nights Darth Vader"]})

df = df.explode("Names")

mask = [a in b[-20:] for a,b in zip(df["Names"],df["Articles"])]

print (df.loc[mask].groupby("Articles").agg(list))

                                                            Names
Articles                                                         
The Death Star has proven itself to be a top sp...  [Darth Vader]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM