比较两列中的字符串以生成新列

Question

I have a dataframe with two columns: "names" (~10 characters per entry) and "articles" (~20,000 characters per entry).我有一个 dataframe 有两列：“名称”（每个条目约 10 个字符）和“文章”（每个条目约 20,000 个字符）。

  Names                        | Articles
 ----------------------------------------------------------------------------------------------
| ['Craig Johnson']            | In the news yesterday, there were 80 reports of arson, and that's just the start of it...
 ----------------------------------------------------------------------------------------------
| ['Jim Billy', 'Bob Cob']     | In the news today, there were 81 reports of arson. Things are heating up...
 ----------------------------------------------------------------------------------------------
| ['Darth Vadar']              | The Death Star has proven itself to be a top spot for bowling nights and...
 ----------------------------------------------------------------------------------------------

I need to iterate through each row, and determine when a last name from the "names" column appears in the last 20 characters of its row's respective "Articles" column.我需要遍历每一行，并确定“名称”列中的姓氏何时出现在其行各自“文章”列的最后 20 个字符中。

I also need to check every "Articles" row to see if the word "Footer" appears.我还需要检查每个“文章”行以查看是否出现“页脚”一词。

If either a last name or the word "Footer" appears in the last twenty characters of any given "Article," then I need to create a new column, "doctored_articles," wherein the last twenty characters of the article are cut short at the earliest instance of the last name or the string "Footer."如果姓氏或单词“页脚”出现在任何给定“文章”的最后二十个字符中，那么我需要创建一个新列“doctored_articles”，其中文章的最后二十个字符在姓氏或字符串“页脚”的最早实例。

If neither "Footer" nor the last name appears in the last 20 characters, then the "doctored_articles" entry should be the same as its "Articles" entry.如果最后 20 个字符中既没有出现“页脚”也没有出现姓氏，那么“doctored_articles”条目应该与其“Articles”条目相同。

I'm not sure what the best way to approach the row iteration and comparison bit, and would really appreciate any help.我不确定处理行迭代和比较位的最佳方法是什么，并且非常感谢任何帮助。 Thank you so much in advance!非常感谢您！

Sample Case:示例案例：

  Names                        | Articles
 ----------------------------------------------------------------------------------------------
| ['Craig Johnson']            | Craig Johnson: In the news yesterday, there were 80 reports of arson, and that's just the start of it. Yada...yada...yada...yada...yada...yada...yada...This article was written by C. Johnson, footer, ok
 ----------------------------------------------------------------------------------------------

Expected Output Column:预期 Output 列：

  Names                        | Doctored_Article
 ----------------------------------------------------------------------------------------------
| ['Craig Johnson']            | Craig Johnson: In the news yesterday, there were 80 reports of arson, and that's just the start of it. Yada...yada...yada...yada...yada...yada...yada...This article was written by C. 
 ----------------------------------------------------------------------------------------------

Answer 1

You can explode the column Names , create a mask with zip and finally agg the results back together:您可以zip列Names ，使用explode创建一个掩码，最后agg结果聚合在一起：

df = pd.DataFrame({"Names":[['Craig Johnson'],['Jim Billy', 'Bob Cob'],['Darth Vader']],
                   "Articles":["In the news yesterday, there were 80 reports of arson, and that's just the start of it...",
                               "In the news today, there were 81 reports of arson. Things are heating up",
                               "The Death Star has proven itself to be a top spot for bowling nights Darth Vader"]})

df = df.explode("Names")

mask = [a in b[-20:] for a,b in zip(df["Names"],df["Articles"])]

print (df.loc[mask].groupby("Articles").agg(list))

                                                            Names
Articles                                                         
The Death Star has proven itself to be a top sp...  [Darth Vader]

比较两列中的字符串以生成新列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-30 06:35:51

比较两列中的字符串以生成新列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-30 06:35:51

解决方案1
1 已采纳 2020-06-30 06:35:51