簡體   English   中英

Pandas/python 在一列列表中加入/合並兩個數據框

[英]Pandas/python join/merge two dataframes on a column of list

讓我們考慮兩個數據框: PersonMovie

dataframe Person

+---+-----------+-------------------+-----------------------------+-----------------------------------------+
|   |    nconst |       primaryName |           primaryProfession |                          knownForTitles |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 0 | nm0000103 |      Fairuza Balk |          actress,soundtrack | tt0181875,tt0089908,tt0120586,tt0115963 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 1 | nm0000106 |    Drew Barrymore | producer,actress,soundtrack | tt0120888,tt0343660,tt0151738,tt0120631 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 2 | nm0000117 |     Neve Campbell | actress,producer,soundtrack | tt0134084,tt1262416,tt0120082,tt0117571 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 3 | nm0000132 |      Claire Danes | actress,producer,soundtrack | tt0274558,tt0108872,tt1796960,tt0117509 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 4 | nm0000138 | Leonardo DiCaprio |       actor,producer,writer | tt0120338,tt0993846,tt1375666,tt0407887 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+

dataframe Movie

+---+-----------+-----------+---------------------+-----------------------+
|   |    tconst | titleType |       originalTitle |                genres |
+---+-----------+-----------+---------------------+-----------------------+
| 0 | tt0192789 |     movie | While Supplies Last |        Comedy,Musical |
+---+-----------+-----------+---------------------+-----------------------+
| 1 | tt4914592 |     movie |      Electric Heart | Adventure,Drama,Music |
+---+-----------+-----------+---------------------+-----------------------+
| 2 | tt4999994 |     movie |           Rain Doll |                 Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 3 | tt2690572 |     movie |             Polaris |                 Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 4 | tt1562859 |     movie |           Golmaal 3 |         Action,Comedy |
+---+-----------+-----------+---------------------+-----------------------+

如您所見,來自PersonknownForTitles是來自Movie dataframe 的tconst列表

問題:

  1. 我如何計算“有多少actors曾經在一部action片中演過戲?
  2. 有多少演員主演不止一種類型的電影?

首先,我們將person創建為 DataFrame:

columns = ['nconst', 'primaryName', 'primaryProfession', 'knownForTitles',]

data = [
('nm0000103',      'Fairuza Balk',          'actress,soundtrack', 'tt0181875,tt0089908,tt0120586,tt0115963'),
('nm0000106',    'Drew Barrymore', 'producer,actress,soundtrack', 'tt0120888,tt0343660,tt0151738,tt0120631'),
('nm0000117',     'Neve Campbell', 'actress,producer,soundtrack', 'tt0134084,tt1262416,tt0120082,tt0117571'),
('nm0000132',      'Claire Danes', 'actress,producer,soundtrack', 'tt0274558,tt0108872,tt1796960,tt0117509'),
('nm0000138', 'Leonardo DiCaprio',       'actor,producer,writer', 'tt0120338,tt0993846,tt1375666,tt0407887'),
]

person = pd.DataFrame(data=data, columns=columns)

其次,我們將字符串分成兩列的列表:

for field in ['primaryProfession', 'knownForTitles']:
    person[field] = person[field].str.split(',')

三、我們使用explode function 將一行轉換為多行:

person = person.explode('knownForTitles').explode('primaryProfession')

四、我們select 唯一的演員/演員為主要職業:

actor_actress = person[ person['primaryProfession'].isin(['actress', 'actor'])]

現在,我們有一個所謂的整潔格式的數據框(每個單元格都有一個值,而不是列表):

    nconst     primaryName   primaryProfession knownForTitles
0   nm0000103  Fairuza Balk   actress          tt0181875
0   nm0000103  Fairuza Balk   actress          tt0089908
0   nm0000103  Fairuza Balk   actress          tt0120586
0   nm0000103  Fairuza Balk   actress          tt0115963
1   nm0000106  Drew Barrymore actress          tt0120888

此時,我們可以對 Movie 數據幀重復這些步驟,然后加入 actor(使用 knownForTitles)和 Movies(使用 tconst)。

很抱歉這個回復的長度。 這種做法的關鍵是先使用str.split(',') ,然后使用explode()將數據框轉換成適合join、merge等的格式。

我正在學習 pandas,所以我很有可能走錯了路。 也就是說,讓我們給它一個 go:

首先,讓我們看看是否可以找到 df Movie 中所有動作片的行。 查看Pandas dataframe select 行,其中列表列包含任何字符串列表,我想出了這個:

Movies['isAction'] = [ 'Action'  in x for x in Movies['genres'].tolist()  ] 

結果如下:

      tconst titleType      originalTitle                     genres  isAction
0  tt0407887     movie  WhileSuppliesLast          [Comedy, Musical]     False
1  tt1375666     movie      ElectricHeart  [Adventure, Drama, Music]     False
2  tt4999994     movie           RainDoll                    [Drama]     False
3  tt2690572     movie            Polaris                    [Drama]     False
4  tt0134084     movie           Golmaal3           [Action, Comedy]      True

我將isAction列添加到 Movies df。 我還更改了一些tconst值,以便我們可以獲得一些積極的結果(第 0、1 和 4 行已更改)。

我更改了row 4 ,以便 Neve Cambelle 出現在結果中。

我們現在可以生成動作電影的tconst列表:

 listOfActionMovies = Movies[ Movies["isAction"] == True]["tconst"].tolist()

現在使用Pandas dataframe select 行的解決方案,其中列表列再次包含任何字符串列表

Person["inAction"] = pd.DataFrame(Person.knownForTitles.tolist()).isin( listOfActionMovies ).any(1)

這產生:

      nconst       primaryName                primaryProfession                                knownForTitles  inAction
0  nm0000103       FairuzaBalk            [actress, soundtrack]  [tt0181875, tt0089908, tt0120586, tt0115963]     False
1  nm0000106     DrewBarrymore  [producer, actress, soundtrack]  [tt0120888, tt0343660, tt0151738, tt0120631]     False
2  nm0000117      NeveCampbell  [actress, producer, soundtrack]  [tt0134084, tt1262416, tt0120082, tt0117571]      True
3  nm0000132       ClaireDanes  [actress, producer, soundtrack]  [tt0274558, tt0108872, tt1796960, tt0117509]     False
4  nm0000138  LeonardoDiCaprio        [actor, producer, writer]  [tt0120338, tt0993846, tt1375666, tt0407887]     False

現在終於可以統計動作片中的所有People了:

len(Person[ Person["inAction"] == True ])

根據條件獲取 dataframe 行數提供的len()解決方案。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM