I have some data that tracks tweets and responses based on source_id and response_id. The source_id could be associated with an original post or a response that has its own response. If there are multiple responses, then each response will have a source_id and that source_id will appear in the response_id of the corresponding response.
Take this dataframe for example:
df = pd.DataFrame({
'date': ['2018-10-02', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03', '2018-10-03'],
'id': ['334', '335', '336', '337', '338', '340', '341', '343', '358'],
'source_id': ['830', '636', '657', '569', '152', '975', '984', '720', '524'],
'reply_id': [np.nan, '495', '636', '657', '569', '830', '152', np.nan, np.nan]
})
And its output:
date id source_id reply_id
0 2018-10-02 334 830 NaN
1 2018-10-03 335 636 495
2 2018-10-03 336 657 636
3 2018-10-03 337 569 657
4 2018-10-03 338 152 569
5 2018-10-03 340 975 830
6 2018-10-03 341 984 152
7 2018-10-03 343 720 NaN
8 2018-10-03 358 524 NaN
Each row contains data for a single message. There is a unique ID for the message whether it's a tweet or a response to a tweet. In this sample, there are two "conversations" with one or more responses to an original post and two standalone tweets with no responses. The tweets with no responses are df.iloc[7]
and df.iloc[8]
both of which have NaNs in reply_id and their source_ids do not appear in the reply_ids of any other rows. While df.iloc[0]
has NaN in reply_id, its source_id appears in the reply_id of df.iloc[5]
. So that would be counted as one conversation.
What I'm really struggling with is how to chain together a series of tweets/responses such as df.iloc[1]
, df.iloc[2]
, df.iloc[3]
, df.iloc[4]
, and df.iloc[6]
and count all of that as one conversation. And for this particular conversation, there is no data available for the original post so there is no row with source_id = 495.
Does anyone have any idea on how to approach this?
From my understanding , this more like a network problem , so we using networkx
import networkx as nx
G=nx.from_pandas_edgelist(df.dropna(), 'reply_id', 'source_id')
l=list(nx.connected_components(G))
newdf=pd.DataFrame(l)
newdf
Out[334]:
0 1 2 3 4 5
0 975 830 None None None None
1 984 495 636 152 569 657
# here you saw all the value belong to one group, they are in the same line
More detail , right now same group of index will have same id
d=[dict.fromkeys(y,x)for x , y in enumerate(list(nx.connected_components(G)))]
d={k:v for element in d for k,v in element.items()}
ids=df.reply_id.dropna().map(d)
ids
Out[344]:
1 1
2 1
3 1
4 1
5 0
6 1
Name: reply_id, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.