pandas - 基於子串出現的計數表達方法

Question

假設我有一個如下所示的DataFrame：

df=pd.DataFrame({'name': ['john','jack','jill','al','zoe','jenn','ringo','paul','george','lisa'], 'how do you feel?': ['excited', 'not excited', 'excited and nervous', 'worried', 'really worried', 'excited', 'not that worried', 'not that excited', 'nervous', 'nervous']})

      how do you feel?    name
0              excited    john
1          not excited    jack
2  excited and nervous    jill
3              worried      al
4       really worried     zoe
5              excited    jenn
6     not that worried   ringo
7     not that excited    paul
8              nervous  george
9              nervous    lisa

我對這些計數感興趣，但分為三類：“興奮”，“擔心”和“緊張”。

問題是“興奮和緊張”應該與“興奮”分組。 事實上，包含“興奮”的字符串應該包含在一個組中，除了 “不那么興奮”和“不興奮”之類的字符串。 同樣的邏輯適用於“擔心”和“緊張”。 （注意“興奮和緊張”實際上屬於“興奮”和“緊張”組）

您可以看到典型的groupby不起作用，字符串搜索必須靈活。

我有一個解決方案，但想知道你們是否都可以找到更好的方法來成為Pythonic，和/或使用我可能不知道的更合適的方法。

這是我的解決方案：

定義一個函數來返回包含所需子字符串的行的計數，並且不包含否定情緒的子字符串

def get_perc(df, column_label, str_include, str_exclude):

    data=df[col_lab][(~df[col_lab].str.contains(str_exclude, case=False)) & \
    (df[col_lab].str.contains(str_include,  case=False))]

    num=data.count()

    return num

然后，在循環內部調用此函數，傳入各種“str.contains”參數，並將結果收集到另一個DataFrame中。

groups=['excited', 'worried', 'nervous']
column_label='How do you feel?'

data=pd.DataFrame([], columns=['group','num'])
for str_include in groups:
    num=get_perc(df, column_label, str_include, 'not|neither')
    tmp=pd.DataFrame([{'group': str_include,'num': num}])
    data=pd.concat([data, tmp])


data

      group    num
0   excited      3
1   worried      2
2   nervous      3

有沒有一種更清潔的方式來做到這一點你能想到的？ 我確實在“ str.contains ”中嘗試了一個正則表達式來嘗試避免需要兩個布爾系列和“ & ”。 但是，如果沒有捕獲組，我就無法做到這一點，這意味着我必須使用“ str.extract ”，這似乎不允許我以相同的方式選擇數據。

任何幫助深表感謝。

Answer 1

你可以這樣做：

方法1

忽略not行，然后
從指標字符串中獲取相關groups 。

In [140]: col = 'how do you feel?'

In [141]: groups = ['excited', 'worried', 'nervous']

In [142]: df.loc[~df[col].str.contains('not '), col].str.get_dummies(sep=' ')[groups].sum()
Out[142]:
excited    3
worried    2
nervous    3
dtype: int64

方法2

In [162]: dfs = df['how do you feel?'].str.get_dummies(sep=' ')

In [163]: dfs.loc[~dfs['not'].astype(bool), groups].sum()
Out[163]:
excited    3
worried    2
nervous    3
dtype: int64

Answer 2

您可以簡單地提供映射，然后按映射產生的新系列進行分組。

map_dict = {'excited and nervous':'excited', 'not that excited':'not excited', 
            'really worried':'worried', 'not that worried':'not worried'}
df.groupby(df['how do you feel?'].replace(map_dict)).size()

輸出：

how do you feel?
excited        3
nervous        2
not excited    2
not worried    1
worried        2
dtype: int64

pandas - 基於子串出現的計數表達方法

問題描述

這是我的解決方案：

2 個解決方案

解決方案1
4 已采納 2018-07-18 18:20:38

解決方案2
3 2018-07-18 18:16:52

pandas - 基於子串出現的計數表達方法

問題描述

這是我的解決方案：

2 個解決方案

解決方案1 4 已采納 2018-07-18 18:20:38

解決方案2 3 2018-07-18 18:16:52

解決方案1
4 已采納 2018-07-18 18:20:38

解決方案2
3 2018-07-18 18:16:52