用來自另一個數據框的匹配ID替換熊貓中的單元格值

Question

我有一個數據框，其中包含通過熊貓庫存儲的域（或本例中的頂點/節點）列表：

                 domain
0            airbnb.com
1          facebook.com
2                st.org
3              index.co
4        crunchbase.com
5               avc.com
6        techcrunch.com
7            google.com

我有另一個數據框，其中包含這些域之間的連接（也稱為邊緣）：

           source_domain    destination_domain
0             airbnb.com            google.com
1           facebook.com            google.com
2                 st.org          facebook.com
3                 st.org            airbnb.com
4                 st.org        crunchbase.com
5               index.co        techcrunch.com
6         crunchbase.com        techcrunch.com
7         crunchbase.com            airbnb.com
8                avc.com        techcrunch.com
9         techcrunch.com                st.org
10        techcrunch.com            google.com
11        techcrunch.com          facebook.com

因為此數據集將變得更大，所以我讀到如果僅使用整數而不是字符串表示“ edges”數據框，則可以提高性能。

因此，我想知道是否存在一種快速的方法來用域（又稱為頂點）數據框中的相應ID替換邊緣數據框中的每個單元格？ 因此，邊緣數據框中的第1行可能看起來像這樣：

###### Before: ##################### 
1           facebook.com google.com   
###### After:  #####################   
1           1            7

我該怎么做呢？ 先感謝您。

Answer 1

這是分類數據的好用例： http : //pandas.pydata.org/pandas-docs/stable/categorical.html

簡而言之，分類系列將內部將每個項目表示為一個數字，但將其顯示為一個字符串。 當您有很多重復的字符串時，這很有用。

與將所有內容手動轉換為整數相比，使用“分類系列”更容易且更容易出錯。

Answer 2

我嘗試實現另一個答案-轉換為Catagorical ，對於ints使用cat.codes ：

#if always unique domain in df1 can be omit
#cats = df1['domain'].unique()
cats = df1['domain']
df2['source_domain'] = df2['source_domain'].astype('category', categories=cats)
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
df2['source_code'] = df2['source_domain'].cat.codes
df2['dest_code'] = df2['destination_domain'].cat.codes
print (df2)
     source_domain destination_domain  source_code  dest_code
0       airbnb.com         google.com            0          7
1     facebook.com         google.com            1          7
2           st.org       facebook.com            2          1
3           st.org         airbnb.com            2          0
4           st.org     crunchbase.com            2          4
5         index.co     techcrunch.com            3          6
6   crunchbase.com     techcrunch.com            4          6
7   crunchbase.com         airbnb.com            4          0
8          avc.com     techcrunch.com            5          6
9   techcrunch.com             st.org            6          2
10  techcrunch.com         google.com            6          7
11  techcrunch.com       facebook.com            6          1

df2['source_domain'] = df2['source_domain'].astype('category', categories=cats).cat.codes
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
                                                     .cat.codes
print (df2)
    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

如果要用dict替換，請使用map ：

d = dict(zip(df1.domain.values, df1.index.values))
df2['source_code'] = df2['source_domain'].map(d)
df2['dest_code'] = df2['destination_domain'].map(d)
print (df2)
     source_domain destination_domain  source_code  dest_code
0       airbnb.com         google.com            0          7
1     facebook.com         google.com            1          7
2           st.org       facebook.com            2          1
3           st.org         airbnb.com            2          0
4           st.org     crunchbase.com            2          4
5         index.co     techcrunch.com            3          6
6   crunchbase.com     techcrunch.com            4          6
7   crunchbase.com         airbnb.com            4          0
8          avc.com     techcrunch.com            5          6
9   techcrunch.com             st.org            6          2
10  techcrunch.com         google.com            6          7
11  techcrunch.com       facebook.com            6          1

Answer 3

最簡單的方法是從頂點數據幀生成一個字典... 如果可以確定它代表了將出現在邊緣的確定的頂點集...並將其與replace一起replace

由於頂點數據框的索引已經具有因子信息...

m = dict(zip(vertices.domain, vertices.index))
edges.replace(m)

    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

您也可以使用stack / map / unstack

m = dict(zip(vertices.domain, vertices.index))
edges.stack().map(m).unstack()

    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

社論

除了提供我自己的信息之外，我還想評論@JohnZwinck的答案。

首先， categorical將提供更快的性能。 但是，我尚不清楚一種確保您可以擁有兩列協調類別的方法。 我所說的協調是指每列在后台分配給每個類別的一組整數。 我們知道（或不知道）這些整數相同的方法。 如果我們將它做成一個大列，然后將該列轉換為分類列，那將起作用...但是，我相信一旦我們再次分成兩列，它將變成對象。

用來自另一個數據框的匹配ID替換熊貓中的單元格值

問題描述

3 個解決方案

解決方案1
2 2017-05-14 04:47:19

解決方案2
2 已采納 2017-05-14 05:10:28

解決方案3
2 2017-05-14 05:12:34

用來自另一個數據框的匹配ID替換熊貓中的單元格值

問題描述

3 個解決方案

解決方案1 2 2017-05-14 04:47:19

解決方案2 2 已采納 2017-05-14 05:10:28

解決方案3 2 2017-05-14 05:12:34

解決方案1
2 2017-05-14 04:47:19

解決方案2
2 已采納 2017-05-14 05:10:28

解決方案3
2 2017-05-14 05:12:34