NLP：将数据集预处理为新数据集

Question

I need help with processing an unsorted dataset.我需要帮助处理未排序的数据集。 Sry, if I am a complete noob.对不起，如果我是一个完整的菜鸟。 I never did anything like that before.我以前从来没有做过那样的事。 So as you can see, each conversation is identified by a dialogueID which consists of multiple rows of "from" & "to", as well as text messages.如您所见，每个对话都由一个 dialogueID 标识，该 dialogueID 由多行“from”和“to”以及文本消息组成。 I would like to concatenate the text messages from the same sender of a dialogueID to one column and from the receiver to another column.我想将来自 dialogueID 的同一发件人的短信连接到一列，并将接收方的短信连接到另一列。 This way, I could have a new csv-file with just [dialogueID, sender, receiver].这样，我就可以得到一个只有 [dialogueID, sender, receiver] 的新 csv 文件。

the new dataset should look like this新的数据集应该是这样的

I watched multiple tutorials and really struggle to figure out how to do it.我看了多个教程，真的很难弄清楚如何去做。 I read in this 9-year-old post that iterating through data frames are not a good idea.我在这篇9 年前的帖子中读到，遍历数据框并不是一个好主意。 Could someone help me out with a code snippet or give me a hint on how to properly do it without overcomplicating things?有人可以用代码片段帮助我，或者给我一个提示，告诉我如何正确地做到这一点而不会使事情过于复杂吗？ I thought something like this pseudo code below, but the performance with 1 million rows is not great, right?我想到了类似下面这个伪代码的东西，但是 100 万行的性能不是很好，对吧？

while !endOfFile
  for dialogueID in range (0, 1038324)
    if dialogueID+1 == dialogueID and toValue.isnull()
      concatenate textFromPrevRow + " " + textFromCurrentRow
      add new string to table column sender
    else
      add text to column receiver

Answer 1

Edit 1编辑 1

According to your clarification, this is what I believe you're looking for.根据您的澄清，这就是我相信您正在寻找的。

Create an aggregation function which basically concats your string values with a line-break character.创建一个聚合 function，它基本上将您的字符串值与换行符连接起来。 Then group by dialogueID and apply your aggregation.然后按dialogueID分组并应用您的聚合。

d = {}
d['from'] = '\n'.join
d['to'] = '\n'.join
new_df = dialogue_dataframe.groupby('dialogueID', as_index=False).agg(d)

After that rename the columns as you'd like:之后根据需要重命名列：

df.rename(columns={"from": "sender", "to": "receiver"})

Original answer原答案

Not quite sure I understood what you try to achieve, but maybe this will give some insights.不太确定我是否理解您要实现的目标，但也许这会提供一些见解。 Maybe write a couple of rows of the table you expect to get, for better clarification也许写几行你希望得到的表格，以便更好地说明

Answer 2

While the exact structure of the data (and thus your task) is not completely clear, maybe DataFrame.apply or rather DataFrame.aggregate can help you speed things up.虽然数据的确切结构（以及您的任务）并不完全清楚，但也许DataFrame.apply或者更确切地说DataFrame.aggregate可以帮助您加快速度。 Also, I would aggregate into either a dictionary or dataframe indexed by dialogue id.另外，我会聚合成一本字典或 dataframe 由对话 ID 索引。 This way you can easily check if a given dialogue / sender already exists.这样您就可以轻松检查给定的对话/发件人是否已经存在。

NLP：将数据集预处理为新数据集

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-11-16 23:37:48

Edit 1编辑 1

Original answer原答案

解决方案2
0 2022-11-16 23:57:31

NLP：将数据集预处理为新数据集

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-11-16 23:37:48

Edit 1编辑 1

Original answer原答案

解决方案2 0 2022-11-16 23:57:31

解决方案1
1 已采纳 2022-11-16 23:37:48

解决方案2
0 2022-11-16 23:57:31