基於列之間的部分字符串匹配連接數據幀

Question

我有一個 dataframe 我想比較它們是否存在於另一個 df 中。

after_h.sample(10, random_state=1)

             movie           year   ratings
108 Mechanic: Resurrection   2016     4.0
206 Warcraft                 2016     4.0
106 Max Steel                2016     3.5
107 Me Before You            2016     4.5

我想比較上述電影是否出現在另一個 df 中。

              FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000 
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560

我想要這樣的東西作為我的最終 output：

    FILM              votes
0  Max Steel           560

Answer 1

有兩種方式：

獲取部分匹配的行索引： FILM.startswith(title)或FILM.contains(title) 。 兩者之一：
df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]

     movie      year      ratings
106  Max Steel  2016      3.5

或者，如果將復合字符串列 df2['FILM'] 轉換為其兩個組件列movie_title (year) ，則可以使用merge() movie_title (year) 。

.

# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)

df2.merge(df1)
       movie  year  Votes  ratings
0  Max Steel  2016    560      3.5

（感謝@user3483203 在這里和 Python 聊天室的幫助）

重新創建數據幀的代碼：

import pandas as pd
from pandas.compat import StringIO

dat1 = """movie           year   ratings
108  Mechanic: Resurrection   2016     4.0
206  Warcraft                 2016     4.0
106  Max Steel                2016     3.5
107  Me Before You            2016     4.5"""

dat2 = """FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560"""

df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')

Answer 2

給定輸入數據幀df1和df2 ，您可以通過pd.Series.isin使用布爾索引。 要對齊電影字符串的格式，您需要先從df1連接電影和年份：

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

               FILM  VOTES
4  Max Steel (2016)    560

Answer 3

smci的選項 1 幾乎就在那里，以下對我有用：

df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))

解釋：

在 df1 中創建一個 Votes 列

將 lambda 應用於 df1 中的每個電影字符串

lambda 查找 df2，選擇 df2 中 Film 以電影標題開頭的所有行

選擇 df2 結果子集的 Votes 列

使用 any(0) 取此列中的第一個值

基於列之間的部分字符串匹配連接數據幀

問題描述

3 個解決方案

解決方案1
6 2018-09-10 22:07:42

解決方案2
2 已采納 2018-09-10 21:06:33

解決方案3
0 2019-07-12 20:57:58

基於列之間的部分字符串匹配連接數據幀

問題描述

3 個解決方案

解決方案1 6 2018-09-10 22:07:42

解決方案2 2 已采納 2018-09-10 21:06:33

解決方案3 0 2019-07-12 20:57:58

解決方案1
6 2018-09-10 22:07:42

解決方案2
2 已采納 2018-09-10 21:06:33

解決方案3
0 2019-07-12 20:57:58