[英]Pandas/python join/merge two dataframes on a column of list
讓我們考慮兩個數據框: Person
和Movie
:
dataframe Person
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| | nconst | primaryName | primaryProfession | knownForTitles |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 0 | nm0000103 | Fairuza Balk | actress,soundtrack | tt0181875,tt0089908,tt0120586,tt0115963 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 1 | nm0000106 | Drew Barrymore | producer,actress,soundtrack | tt0120888,tt0343660,tt0151738,tt0120631 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 2 | nm0000117 | Neve Campbell | actress,producer,soundtrack | tt0134084,tt1262416,tt0120082,tt0117571 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 3 | nm0000132 | Claire Danes | actress,producer,soundtrack | tt0274558,tt0108872,tt1796960,tt0117509 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 4 | nm0000138 | Leonardo DiCaprio | actor,producer,writer | tt0120338,tt0993846,tt1375666,tt0407887 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
dataframe Movie
+---+-----------+-----------+---------------------+-----------------------+
| | tconst | titleType | originalTitle | genres |
+---+-----------+-----------+---------------------+-----------------------+
| 0 | tt0192789 | movie | While Supplies Last | Comedy,Musical |
+---+-----------+-----------+---------------------+-----------------------+
| 1 | tt4914592 | movie | Electric Heart | Adventure,Drama,Music |
+---+-----------+-----------+---------------------+-----------------------+
| 2 | tt4999994 | movie | Rain Doll | Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 3 | tt2690572 | movie | Polaris | Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 4 | tt1562859 | movie | Golmaal 3 | Action,Comedy |
+---+-----------+-----------+---------------------+-----------------------+
如您所見,來自Person
的knownForTitles
是來自Movie
dataframe 的tconst
列表
問題:
actors
曾經在一部action
片中演過戲?首先,我們將person
創建為 DataFrame:
columns = ['nconst', 'primaryName', 'primaryProfession', 'knownForTitles',]
data = [
('nm0000103', 'Fairuza Balk', 'actress,soundtrack', 'tt0181875,tt0089908,tt0120586,tt0115963'),
('nm0000106', 'Drew Barrymore', 'producer,actress,soundtrack', 'tt0120888,tt0343660,tt0151738,tt0120631'),
('nm0000117', 'Neve Campbell', 'actress,producer,soundtrack', 'tt0134084,tt1262416,tt0120082,tt0117571'),
('nm0000132', 'Claire Danes', 'actress,producer,soundtrack', 'tt0274558,tt0108872,tt1796960,tt0117509'),
('nm0000138', 'Leonardo DiCaprio', 'actor,producer,writer', 'tt0120338,tt0993846,tt1375666,tt0407887'),
]
person = pd.DataFrame(data=data, columns=columns)
其次,我們將字符串分成兩列的列表:
for field in ['primaryProfession', 'knownForTitles']:
person[field] = person[field].str.split(',')
三、我們使用explode
function 將一行轉換為多行:
person = person.explode('knownForTitles').explode('primaryProfession')
四、我們select 唯一的演員/演員為主要職業:
actor_actress = person[ person['primaryProfession'].isin(['actress', 'actor'])]
現在,我們有一個所謂的整潔格式的數據框(每個單元格都有一個值,而不是列表):
nconst primaryName primaryProfession knownForTitles
0 nm0000103 Fairuza Balk actress tt0181875
0 nm0000103 Fairuza Balk actress tt0089908
0 nm0000103 Fairuza Balk actress tt0120586
0 nm0000103 Fairuza Balk actress tt0115963
1 nm0000106 Drew Barrymore actress tt0120888
此時,我們可以對 Movie 數據幀重復這些步驟,然后加入 actor(使用 knownForTitles)和 Movies(使用 tconst)。
很抱歉這個回復的長度。 這種做法的關鍵是先使用str.split(',')
,然后使用explode()
將數據框轉換成適合join、merge等的格式。
我正在學習 pandas,所以我很有可能走錯了路。 也就是說,讓我們給它一個 go:
首先,讓我們看看是否可以找到 df Movie 中所有動作片的行。 查看Pandas dataframe select 行,其中列表列包含任何字符串列表,我想出了這個:
Movies['isAction'] = [ 'Action' in x for x in Movies['genres'].tolist() ]
結果如下:
tconst titleType originalTitle genres isAction
0 tt0407887 movie WhileSuppliesLast [Comedy, Musical] False
1 tt1375666 movie ElectricHeart [Adventure, Drama, Music] False
2 tt4999994 movie RainDoll [Drama] False
3 tt2690572 movie Polaris [Drama] False
4 tt0134084 movie Golmaal3 [Action, Comedy] True
我將isAction
列添加到 Movies df。 我還更改了一些tconst
值,以便我們可以獲得一些積極的結果(第 0、1 和 4 行已更改)。
我更改了row 4
,以便 Neve Cambelle 出現在結果中。
我們現在可以生成動作電影的tconst
列表:
listOfActionMovies = Movies[ Movies["isAction"] == True]["tconst"].tolist()
現在使用Pandas dataframe select 行的解決方案,其中列表列再次包含任何字符串列表:
Person["inAction"] = pd.DataFrame(Person.knownForTitles.tolist()).isin( listOfActionMovies ).any(1)
這產生:
nconst primaryName primaryProfession knownForTitles inAction
0 nm0000103 FairuzaBalk [actress, soundtrack] [tt0181875, tt0089908, tt0120586, tt0115963] False
1 nm0000106 DrewBarrymore [producer, actress, soundtrack] [tt0120888, tt0343660, tt0151738, tt0120631] False
2 nm0000117 NeveCampbell [actress, producer, soundtrack] [tt0134084, tt1262416, tt0120082, tt0117571] True
3 nm0000132 ClaireDanes [actress, producer, soundtrack] [tt0274558, tt0108872, tt1796960, tt0117509] False
4 nm0000138 LeonardoDiCaprio [actor, producer, writer] [tt0120338, tt0993846, tt1375666, tt0407887] False
現在終於可以統計動作片中的所有People
了:
len(Person[ Person["inAction"] == True ])
根據條件獲取 dataframe 行數提供的len()
解決方案。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.