在 pandas dataframe 中查找具有匹配列子字符串的行對

Question

我有一個 dataframe 有幾列。 其中之一被命名為'log_text'. 我想在此列中查找具有匹配字符串的行對。

例如，如果'log_text'有這些字符串

 Device remove ID#xxx  
 Device remove ID#yyy  
 Device remove ID#zzz  
 Device arrive ID#xxx  
 Device arrive ID#yyy 
 Device arrive ID#zzz

目標：我想獲取包含'Device remove ID#xxx'和'Device arrive ID#xxx'行，並能夠對其他列進行處理，然后對包含'Device remove ID#yyy'的行重復此操作和'Device arrive ID#yyy'等

我嘗試的是使用iterrows() ，找到當前行的ID# ，從表中刪除該行，然后找到包含匹配 ID# 字符串的第一行。

    for index, row in temp_df.iterrows():
        log_string = row['log_text']
        id_text = log_string.partition("ID#")[2]
        temp_df.drop(row)
        match = temp_df[temp_df['log_text'].str.contains(id_text)]
        # Somehow stash the 2 rows together somewhere? 
            # like stash[index,1] = row; stash[index,2] = match;
        temp_df.drop(match)

Answer 1

您可以使用pandas.Series.str.split和pandas.groupby ：

In [10]: df = pd.DataFrame({'log':['Device remove ID#xxx',
    ...:                           'Device remove ID#yyy',
    ...:                           'Device remove ID#zzz',
    ...:                           'Device arrive ID#xxx',
    ...:                           'Device arrive ID#yyy',
    ...:                           'Device arrive ID#zzz',],
                            'other_row':[1,2,3,42,54,6]})

In [11]: df
Out[11]:
                    log  other_row
0  Device remove ID#xxx          1
1  Device remove ID#yyy          2
2  Device remove ID#zzz          3
3  Device arrive ID#xxx         42
4  Device arrive ID#yyy         54
5  Device arrive ID#zzz          6

In [14]: df_splits = df['log'].str.split(expand=True)

In [16]: df['action'] = df_splits[1]

In [17]: df['user'] = df_splits[2]

In [18]: df
Out[18]:
                    log  other_row  action    user
0  Device remove ID#xxx          1  remove  ID#xxx
1  Device remove ID#yyy          2  remove  ID#yyy
2  Device remove ID#zzz          3  remove  ID#zzz
3  Device arrive ID#xxx         42  arrive  ID#xxx
4  Device arrive ID#yyy         54  arrive  ID#yyy
5  Device arrive ID#zzz          6  arrive  ID#zzz


In [22]: for i, d in df.groupby('user'):
    ...:     print i
    ...:     print d
    ...:     print d['other_row'].sum()
    ...:     print
    ...:
    ...:
ID#xxx
                    log  other_row  action    user
0  Device remove ID#xxx          1  remove  ID#xxx
3  Device arrive ID#xxx         42  arrive  ID#xxx
43

ID#yyy
                    log  other_row  action    user
1  Device remove ID#yyy          2  remove  ID#yyy
4  Device arrive ID#yyy         54  arrive  ID#yyy
56

ID#zzz
                    log  other_row  action    user
2  Device remove ID#zzz          3  remove  ID#zzz
5  Device arrive ID#zzz          6  arrive  ID#zzz
9

Answer 2

國際大學聯合會，

我認為您可以使用.str.count和.loc進行進一步的操作

例如：

rows_to_filter = ['Device remove ID#xxx','Device remove ID#yyy',
'Device remove ID#zzz','Device arrive ID#xxx',
'Device arrive ID#yyy','Device arrive ID#zzz']

df.loc[df['log_text'].str.count('|'.join(rows_to_filter)) > 1, 'col'] = 'do something'

這將返回一個 dataframe 切片，其中包含在任何給定行中出現以上列表的任何內容，您可能需要修改邏輯，因為如果沒有樣本 output，我不是 100% 您需要的。

Answer 3

如果您需要保留原始列並且只想對最后 3 個字符進行排序，則可以為此目的創建一個單獨的列。

df1['group'] = df1['log_text'].str[-3::]

這將創建“log_text”列的副本，但只保留最后三個字符。

在 pandas dataframe 中查找具有匹配列子字符串的行對

問題描述

3 個解決方案

解決方案1
2 已采納 2019-10-16 21:26:27

解決方案2
1 2019-10-16 21:30:26

解決方案3
0 2019-10-16 21:28:44

在 pandas dataframe 中查找具有匹配列子字符串的行對

問題描述

3 個解決方案

解決方案1 2 已采納 2019-10-16 21:26:27

解決方案2 1 2019-10-16 21:30:26

解決方案3 0 2019-10-16 21:28:44

解決方案1
2 已采納 2019-10-16 21:26:27

解決方案2
1 2019-10-16 21:30:26

解決方案3
0 2019-10-16 21:28:44