简体   繁体   English

使用熊猫将句子拆分为包含不同数量单词的子字符串

[英]Split sentences into substrings containing varying number of words using pandas

My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens . 我的问题与我过去的问题有关: 在单元格中拆分文本并为令牌创建其他行

Let's suppose that I have the following in a DataFrame in pandas : 假设我在pandasDataFrame中具有以下内容:

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

and I want to split the text of each id in tokens of random number of words (varying between two values eg 1 and 5) so I finally want to have something like the following: 并且我想将每个id的文本分割为随机数的单词(在两个值之间变化,例如1和5),所以我最终想要拥有以下内容:

id  text
1   I am the
1   first document
1   and I am very
1   happy
2   Here is
2   the second document and it
2   likes playing
2   tennis
3   This is the third
3   document and
3   looks very
3   very good today

Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id above. 请记住,我的数据框可能还具有其他列,除了这两列外,其他列应以与上述id相同的方式简单地复制到新数据框中。

What is the most efficient way to do this? 最有效的方法是什么?

Define a function to extract chunks in a random fashion using itertools.islice : 定义一个函数,使用itertools.islice以随机方式提取块:

from itertools import islice
import random

lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
    chunks = []
    while True:
        chunk = list(islice(it, random.choice(range(lo, hi+1))))
        if not chunk:
            break
        chunks.append(' '.join(chunk))

    return chunks

Call the function through a list comprehension to ensure least possible overhead, then stack to get your output: 通过列表理解调用该函数以确保尽可能少的开销,然后stack以获取输出:

pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()

id   
1   0                    I am the
    1        first document and I
    2              am very happy.
2   0                 Here is the
    1         second document and
    2    it likes playing tennis.
3   0           This is the third
    1       document and it looks
    2            very good today.

You can extend the extract_chunks function to perform tokenisation. 您可以扩展extract_chunks函数以执行令牌化。 Right now, I use a simple splitting on whitespace which you can modify. 现在,我在空格上使用了一个简单的拆分,您可以对其进行修改。


Note that if you have other columns you don't want to touch, you can do something like a melt ing operation here. 请注意,如果您不想触摸其他列,则可以在此处执行类似melt操作的操作。

u = pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']])

(pd.concat([df.drop('text', 1), u], axis=1)
   .melt(df.columns.difference(['text'])))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM