組中的稀有值后，對pandas數據框中的列進行排序

Question

我有一個被抓取網站的熊貓數據框，其中包含網站標識符，網站的文字和標簽。 少數網站有兩個標簽，但是由於我想首先訓練一個標簽分類器，因此我想為每個網站創建一個只有一個標簽的數據版本（我知道這有點麻煩）。 我的數據集中的標簽不平衡（有些標簽經常出現，有些很少出現）。 如果我刪除重復的網站ID，我想刪除最常見的標簽。 這就是我的帶有幾個標簽的數據集的樣子：

ID   Label   Text
1    a       some text
1    b       other text
1    a       data
2    a       words
2    c       more words
3    a       text
3    b       short text

我的想法是按標簽的稀有性對每個網站標識符中的標簽列進行排序。 為此，我首先在label列上執行value_counts(ascending = True) ，以獲取按稀有度排序的所有標簽的列表。

to_sort = [c, b, a]

然后，我想使用該列表按稀有度對每個網站ID進行排序。 不過，我不確定該怎么做。 結果應如下所示：

ID   Label   Text
1    b       other text
1    a       some text
1    a       data
2    c       more words
2    a       words
3    b       short text
3    a       text

然后，我將使用df.drop_duplicates(subset = 'ID', keep = 'first')來保留最稀有的標簽。 我該如何進行分類？

Answer 1

使用有序的categorical ，因此可以使用sort_values ：

to_sort = list('cba')

df['Label'] = pd.Categorical(df['Label'], ordered=True, categories=to_sort)

df = df.sort_values(['ID','Label'])
print (df)
   ID Label        Text
1   1     b  other text
0   1     a   some text
2   1     a        data
4   2     c  more words
3   2     a       words
6   3     b  short text
5   3     a        text

Answer 2

通過將“標簽”列設為“ 分類” ，然后按ID和“ 標簽”進行排序，可以實現您的目標。 讓我們在實踐中看一下。

import pandas as pd
df = pd.DataFrame( {'ID': [1,1,1,2,2,3,3], "Label": ["a", "b", "a", "a", "c", "a", "b"],
                   'Text': ["some text", "other text","data", "words", "more words", "text", "short text"]} )
df
    ID  Label   Text
0   1   a   some text
1   1   b   other text
2   1   a   data
3   2   a   words
4   2   c   more words
5   3   a   text
6   3   b   short text

通過執行以下操作來定義標簽的順序：

to_sort = df.Label.value_counts(ascending = True).index
to_sort
Index(['c', 'b', 'a'], dtype='object')

然后將Label列設為Categorical，如下所示：

df.Label = pd.Categorical(df.Label,categories = to_sort, ordered = True)

最后，按ID和Label排序：

df.sort_values(["ID", "Label"]).reset_index(drop = True)

    ID  Label   Text
0   1   b   other text
1   1   a   some text
2   1   a   data
3   2   c   more words
4   2   a   words
5   3   b   short text
6   3   a   text

組中的稀有值后，對pandas數據框中的列進行排序

問題描述

2 個解決方案

解決方案1
0 已采納 2018-11-12 15:19:42

解決方案2
0 2018-11-12 15:53:13

組中的稀有值后，對pandas數據框中的列進行排序

問題描述

2 個解決方案

解決方案1 0 已采納 2018-11-12 15:19:42

解決方案2 0 2018-11-12 15:53:13

解決方案1
0 已采納 2018-11-12 15:19:42

解決方案2
0 2018-11-12 15:53:13