熊猫-通过基于两列的ID链接多行将消息分组为对话

Question

I have some data that tracks tweets and responses based on source_id and response_id. 我有一些数据可根据source_id和response_id跟踪推文和响应。 The source_id could be associated with an original post or a response that has its own response. source_id可以与原始帖子或具有自己响应的响应关联。 If there are multiple responses, then each response will have a source_id and that source_id will appear in the response_id of the corresponding response. 如果存在多个响应，则每个响应将具有一个source_id，并且该source_id将出现在相应响应的response_id中。

Take this dataframe for example: 以这个数据框为例：

df = pd.DataFrame({
'date': ['2018-10-02', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03'],
'id': ['334', '335', '336', '337', '338', '340', '341', '343', '358'],
'source_id': ['830', '636', '657', '569', '152', '975', '984', '720', '524'],
'reply_id': [np.nan, '495', '636', '657', '569', '830', '152', np.nan, np.nan]
})

And its output: 及其输出：

         date   id source_id reply_id
0  2018-10-02  334       830      NaN
1  2018-10-03  335       636      495
2  2018-10-03  336       657      636
3  2018-10-03  337       569      657
4  2018-10-03  338       152      569
5  2018-10-03  340       975      830
6  2018-10-03  341       984      152
7  2018-10-03  343       720      NaN
8  2018-10-03  358       524      NaN

Each row contains data for a single message. 每行包含一条消息的数据。 There is a unique ID for the message whether it's a tweet or a response to a tweet. 无论是推文还是对推文的响应，消息都有唯一的ID。 In this sample, there are two "conversations" with one or more responses to an original post and two standalone tweets with no responses. 在此示例中，有两个“对话”，一个或多个对原始帖子的回复，以及两个独立的推文，没有回复。 The tweets with no responses are df.iloc[7] and df.iloc[8] both of which have NaNs in reply_id and their source_ids do not appear in the reply_ids of any other rows. 没有响应的推文是df.iloc[7]和df.iloc[8] ，这两个推文在reply_id中都具有NaN，并且它们的source_id不出现在任何其他行的reply_id中。 While df.iloc[0] has NaN in reply_id, its source_id appears in the reply_id of df.iloc[5] . 尽管df.iloc[0]包含NaN，但其source_id出现在df.iloc[5]的reply_id中。 So that would be counted as one conversation. 因此，这将被视为一次对话。

What I'm really struggling with is how to chain together a series of tweets/responses such as df.iloc[1] , df.iloc[2] , df.iloc[3] , df.iloc[4] , and df.iloc[6] and count all of that as one conversation. 我真正要解决的问题是如何将一系列tweet /响应（例如df.iloc[1] ， df.iloc[2] ， df.iloc[3] ， df.iloc[4]和df.iloc[6] df.iloc[4] df.iloc[6]并将其全部计为一次对话。 And for this particular conversation, there is no data available for the original post so there is no row with source_id = 495. 对于此特定对话，原始帖子没有可用数据，因此没有source_id = 495的行。

Does anyone have any idea on how to approach this? 有人对如何解决这个问题有任何想法吗？

Answer 1

From my understanding , this more like a network problem , so we using networkx 据我了解，这更像是网络问题，因此我们使用networkx

import networkx as nx 
G=nx.from_pandas_edgelist(df.dropna(), 'reply_id', 'source_id')
l=list(nx.connected_components(G))
newdf=pd.DataFrame(l)
newdf
Out[334]: 
     0    1     2     3     4     5
0  975  830  None  None  None  None
1  984  495   636   152   569   657 
# here you saw all the value belong to one group, they are in the same line

More detail , right now same group of index will have same id 详细信息，现在同一组索引将具有相同的ID

d=[dict.fromkeys(y,x)for x , y in enumerate(list(nx.connected_components(G)))]
d={k:v for element in d for k,v in element.items()}
ids=df.reply_id.dropna().map(d)
ids
Out[344]: 
1    1
2    1
3    1
4    1
5    0
6    1
Name: reply_id, dtype: int64

熊猫-通过基于两列的ID链接多行将消息分组为对话

问题描述

1 个解决方案

解决方案1
1 2019-03-14 01:28:54

熊猫-通过基于两列的ID链接多行将消息分组为对话

问题描述

1 个解决方案

解决方案1 1 2019-03-14 01:28:54

解决方案1
1 2019-03-14 01:28:54