简体   繁体   中英

Grouping rows in a Pandas DataFrame

We're working with a really large Twitter DataBase containing around 4,9 million entries. Every entry can either be a tweet, or a reply to a tweet (or a reply to a reply of course). Since this data has been collected using the Twitter API tweets and their replies are not neatly grouped in the DataFrame but many entries are in between:

We are trying to group the tweets with their corresponding replies so we can perform a sentimental analysis on this conversation, but this is where we are stuck. We started by inverting the DataFrame as it will be easier to search from the last reply to the original tweet than the other way around.

Now we'll be using the column id (the original tweet ID) and the in_reply_to_status_id (refers to the ID of the original tweet to which it was replied).

In essence we want to create some kind of for loop which detects the first row where the in_reply_to_status_id is an integer and then links this to the reply/tweet above by matching it with the id column. But it has to continue this process until it finds a row where the in_reply_to_status_id is None , as this means you've found the original tweet (as a tweet evidently can not be a reply to something).

So the first entry here would be in_reply_to_status_id = 1244694453190897664, we store this entry and use this to search its "original" tweet: But this gives us a new in_reply_to_status_id of 1243885949697888263 so we store this entry as well but also have to look for its original tweet with this new in_reply_to_status_id . We want to continue this process until we arrive at an entry where in_reply_to_status_id is None , as this marks the end of a conversation.

Would anyone have any ideas on how to start on such an operation?

This seems like a pretty hard operation. I think that i understand a little your issue (but not all of it, sadly). In my opinion, you should firstly group the elements by the "in_reply_to_user" or "in_reply_to_status" (I don't know exactly the difference between the 2), and after that you should verify if the id of the rows where the "in_reply_to_status" == 'None' is present in any other "in_reply_..". In this scenario, you'd take in the first part only the "head" tweets, the one to which the others are pointing, then verifying if any of them has any replies. After that, in my opinion, you should check recursively the id by searching it in the "in reply to" columns, until for each link there is no value to which it points. You could try to recursively create a list/tuple/dict in which you could append the links and the tweets those are linking to (Think at it like a graph), then you could just make a new dataframe which would take the batches containing the specific id/index of each node. This is my take, hope it'll help!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM