简体   繁体   English

NLP:将数据集预处理为新数据集

[英]NLP: pre-processing dataset into a new dataset

I need help with processing an unsorted dataset.我需要帮助处理未排序的数据集。 Sry, if I am a complete noob.对不起,如果我是一个完整的菜鸟。 I never did anything like that before.我以前从来没有做过那样的事。 So as you can see, each conversation is identified by a dialogueID which consists of multiple rows of "from" & "to", as well as text messages.如您所见,每个对话都由一个 dialogueID 标识,该 dialogueID 由多行“from”和“to”以及文本消息组成。 I would like to concatenate the text messages from the same sender of a dialogueID to one column and from the receiver to another column.我想将来自 dialogueID 的同一发件人的短信连接到一列,并将接收方的短信连接到另一列。 This way, I could have a new csv-file with just [dialogueID, sender, receiver].这样,我就可以得到一个只有 [dialogueID, sender, receiver] 的新 csv 文件。

数据集 the new dataset should look like this新的数据集应该是这样的新数据集

I watched multiple tutorials and really struggle to figure out how to do it.我看了多个教程,真的很难弄清楚如何去做。 I read in this 9-year-old post that iterating through data frames are not a good idea.我在这篇9 年前的帖子中读到,遍历数据框并不是一个好主意。 Could someone help me out with a code snippet or give me a hint on how to properly do it without overcomplicating things?有人可以用代码片段帮助我,或者给我一个提示,告诉我如何正确地做到这一点而不会使事情过于复杂吗? I thought something like this pseudo code below, but the performance with 1 million rows is not great, right?我想到了类似下面这个伪代码的东西,但是 100 万行的性能不是很好,对吧?

while !endOfFile
  for dialogueID in range (0, 1038324)
    if dialogueID+1 == dialogueID and toValue.isnull()
      concatenate textFromPrevRow + " " + textFromCurrentRow
      add new string to table column sender
    else
      add text to column receiver

Edit 1编辑 1

According to your clarification, this is what I believe you're looking for.根据您的澄清,这就是我相信您正在寻找的。

Create an aggregation function which basically concats your string values with a line-break character.创建一个聚合 function,它基本上将您的字符串值与换行符连接起来。 Then group by dialogueID and apply your aggregation.然后按dialogueID分组并应用您的聚合。

d = {}
d['from'] = '\n'.join
d['to'] = '\n'.join
new_df = dialogue_dataframe.groupby('dialogueID', as_index=False).agg(d)

After that rename the columns as you'd like:之后根据需要重命名列:

df.rename(columns={"from": "sender", "to": "receiver"})

Original answer原答案

Not quite sure I understood what you try to achieve, but maybe this will give some insights.不太确定我是否理解您要实现的目标,但也许会提供一些见解。 Maybe write a couple of rows of the table you expect to get, for better clarification也许写几行你希望得到的表格,以便更好地说明

While the exact structure of the data (and thus your task) is not completely clear, maybe DataFrame.apply or rather DataFrame.aggregate can help you speed things up.虽然数据的确切结构(以及您的任务)并不完全清楚,但也许DataFrame.apply或者更确切地说DataFrame.aggregate可以帮助您加快速度。 Also, I would aggregate into either a dictionary or dataframe indexed by dialogue id.另外,我会聚合成一本字典或 dataframe 由对话 ID 索引。 This way you can easily check if a given dialogue / sender already exists.这样您就可以轻松检查给定的对话/发件人是否已经存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM