简体繁体 English

在 Pandas DataFrame 中对行进行分组

[英]Grouping rows in a Pandas DataFrame

原文 2022-05-30 12:15:37 2 1 python/ pandas/ twitter

We're working with a really large Twitter DataBase containing around 4,9 million entries.我们正在使用一个非常大的 Twitter 数据库，其中包含大约 490 万个条目。 Every entry can either be a tweet, or a reply to a tweet (or a reply to a reply of course).每个条目可以是推文，也可以是对推文的回复（当然也可以是对回复的回复）。 Since this data has been collected using the Twitter API tweets and their replies are not neatly grouped in the DataFrame but many entries are in between:由于这些数据是使用 Twitter API 推文收集的，并且它们的回复没有整齐地分组在 DataFrame 中，但许多条目介于两者之间：

We are trying to group the tweets with their corresponding replies so we can perform a sentimental analysis on this conversation, but this is where we are stuck.我们正在尝试将推文与其相应的回复分组，以便我们可以对此对话进行情感分析，但这就是我们陷入困境的地方。 We started by inverting the DataFrame as it will be easier to search from the last reply to the original tweet than the other way around.我们从反转 DataFrame 开始，因为从最后一个回复到原始推文的搜索比相反的方式更容易。

Now we'll be using the column id (the original tweet ID) and the in_reply_to_status_id (refers to the ID of the original tweet to which it was replied).现在我们将使用列id （原始推文 ID）和in_reply_to_status_id （指的是它被回复到的原始推文的 ID）。

In essence we want to create some kind of for loop which detects the first row where the in_reply_to_status_id is an integer and then links this to the reply/tweet above by matching it with the id column.本质上，我们想要创建某种 for 循环，它检测in_reply_to_status_id为整数的第一行，然后通过将其与id列匹配将其链接到上面的回复/推文。 But it has to continue this process until it finds a row where the in_reply_to_status_id is None , as this means you've found the original tweet (as a tweet evidently can not be a reply to something).但它必须继续这个过程，直到找到in_reply_to_status_id为None的行，因为这意味着你已经找到了原始推文（因为推文显然不能是对某事的回复）。

So the first entry here would be in_reply_to_status_id = 1244694453190897664, we store this entry and use this to search its "original" tweet:所以这里的第一个条目是in_reply_to_status_id = 1244694453190897664，我们存储这个条目并使用它来搜索它的“原始”推文： But this gives us a new in_reply_to_status_id of 1243885949697888263 so we store this entry as well but also have to look for its original tweet with this new in_reply_to_status_id .但这给了我们一个新的in_reply_to_status_id 1243885949697888263 所以我们也存储了这个条目，但也必须用这个新的in_reply_to_status_id寻找它的原始推文。 We want to continue this process until we arrive at an entry where in_reply_to_status_id is None , as this marks the end of a conversation.我们希望继续这个过程，直到我们到达in_reply_to_status_id为None的条目，因为这标志着对话的结束。

Would anyone have any ideas on how to start on such an operation?有人对如何开始这样的操作有任何想法吗？

1 个解决方案

This seems like a pretty hard operation.这似乎是一个相当困难的操作。 I think that i understand a little your issue (but not all of it, sadly).我认为我对您的问题有所了解（但遗憾的是，并非全部）。 In my opinion, you should firstly group the elements by the "in_reply_to_user" or "in_reply_to_status" (I don't know exactly the difference between the 2), and after that you should verify if the id of the rows where the "in_reply_to_status" == 'None' is present in any other "in_reply_..".在我看来，您应该首先按“in_reply_to_user”或“in_reply_to_status”对元素进行分组（我不知道两者之间的确切区别），然后您应该验证“in_reply_to_status”所在行的id == 'None' 出现在任何其他“in_reply_..”中。 In this scenario, you'd take in the first part only the "head" tweets, the one to which the others are pointing, then verifying if any of them has any replies.在这种情况下，您将仅在第一部分中接收“头部”推文，即其他人指向的推文，然后验证其中是否有任何回复。 After that, in my opinion, you should check recursively the id by searching it in the "in reply to" columns, until for each link there is no value to which it points.之后，在我看来，您应该通过在“回复”列中搜索它来递归检查 id，直到对于每个链接都没有它指向的值。 You could try to recursively create a list/tuple/dict in which you could append the links and the tweets those are linking to (Think at it like a graph), then you could just make a new dataframe which would take the batches containing the specific id/index of each node.您可以尝试递归地创建一个列表/元组/字典，您可以在其中附加链接和链接到的推文（像图表一样思考），然后您可以创建一个新的数据框，该数据框将包含包含每个节点的特定 ID/索引。 This is my take, hope it'll help!这是我的看法，希望对你有帮助！