简体繁体中英

Grouping rows in a Pandas DataFrame

原文 2022-05-30 12:15:37 4 1 python/ pandas/ twitter

We're working with a really large Twitter DataBase containing around 4,9 million entries. Every entry can either be a tweet, or a reply to a tweet (or a reply to a reply of course). Since this data has been collected using the Twitter API tweets and their replies are not neatly grouped in the DataFrame but many entries are in between:

We are trying to group the tweets with their corresponding replies so we can perform a sentimental analysis on this conversation, but this is where we are stuck. We started by inverting the DataFrame as it will be easier to search from the last reply to the original tweet than the other way around.

Now we'll be using the column id (the original tweet ID) and the in_reply_to_status_id (refers to the ID of the original tweet to which it was replied).

In essence we want to create some kind of for loop which detects the first row where the in_reply_to_status_id is an integer and then links this to the reply/tweet above by matching it with the id column. But it has to continue this process until it finds a row where the in_reply_to_status_id is None , as this means you've found the original tweet (as a tweet evidently can not be a reply to something).

So the first entry here would be in_reply_to_status_id = 1244694453190897664, we store this entry and use this to search its "original" tweet: But this gives us a new in_reply_to_status_id of 1243885949697888263 so we store this entry as well but also have to look for its original tweet with this new in_reply_to_status_id . We want to continue this process until we arrive at an entry where in_reply_to_status_id is None , as this marks the end of a conversation.

Would anyone have any ideas on how to start on such an operation?

1 answers

This seems like a pretty hard operation. I think that i understand a little your issue (but not all of it, sadly). In my opinion, you should firstly group the elements by the "in_reply_to_user" or "in_reply_to_status" (I don't know exactly the difference between the 2), and after that you should verify if the id of the rows where the "in_reply_to_status" == 'None' is present in any other "in_reply_..". In this scenario, you'd take in the first part only the "head" tweets, the one to which the others are pointing, then verifying if any of them has any replies. After that, in my opinion, you should check recursively the id by searching it in the "in reply to" columns, until for each link there is no value to which it points. You could try to recursively create a list/tuple/dict in which you could append the links and the tweets those are linking to (Think at it like a graph), then you could just make a new dataframe which would take the batches containing the specific id/index of each node. This is my take, hope it'll help!

Pandas DataFrame: Grouping Rows?

Grouping rows for a dataframe in Pandas

Grouping Pandas dataframe across rows

Grouping rows by time-range in Pandas dataframe

Grouping Pandas dataframe across rows - 2.0

Accepting top rows in pandas dataframe based on grouping

Grouping Pandas DataFrame Rows According to an Index

Grouping rows by proximity of floats in a python pandas dataframe

Grouping all the rows with close timestamps in pandas dataframe

How to merge dictionaries of a pandas dataframe when grouping by rows

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Pandas DataFrame: Grouping Rows? Grouping rows for a dataframe in Pandas Grouping Pandas dataframe across rows Grouping rows by time-range in Pandas dataframe Grouping Pandas dataframe across rows - 2.0 Accepting top rows in pandas dataframe based on grouping Grouping Pandas DataFrame Rows According to an Index Grouping rows by proximity of floats in a python pandas dataframe Grouping all the rows with close timestamps in pandas dataframe How to merge dictionaries of a pandas dataframe when grouping by rows

Related Tags

Grouping rows in a Pandas DataFrame

Question

1 answers

solution1 0 2022-05-30 13:00:17

solution1
0 2022-05-30 13:00:17