[英]Split sentences into substrings containing varying number of words using pandas
My question is related to this past of question of mine: Split text in cells and create additional rows for the tokens . 我的问题与我过去的问题有关: 在单元格中拆分文本并为令牌创建其他行 。
Let's suppose that I have the following in a DataFrame
in pandas
: 假设我在
pandas
的DataFrame
中具有以下内容:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
and I want to split the text of each id in tokens of random number of words (varying between two values eg 1 and 5) so I finally want to have something like the following: 并且我想将每个id的文本分割为随机数的单词(在两个值之间变化,例如1和5),所以我最终想要拥有以下内容:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
Keep in mind that my dataframe may also have other columns except for these two which should be simply copied at the new dataframe in the same way as id
above. 请记住,我的数据框可能还具有其他列,除了这两列外,其他列应以与上述
id
相同的方式简单地复制到新数据框中。
What is the most efficient way to do this? 最有效的方法是什么?
Define a function to extract chunks in a random fashion using itertools.islice
: 定义一个函数,使用
itertools.islice
以随机方式提取块:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
Call the function through a list comprehension to ensure least possible overhead, then stack
to get your output: 通过列表理解调用该函数以确保尽可能少的开销,然后
stack
以获取输出:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
You can extend the extract_chunks
function to perform tokenisation. 您可以扩展
extract_chunks
函数以执行令牌化。 Right now, I use a simple splitting on whitespace which you can modify. 现在,我在空格上使用了一个简单的拆分,您可以对其进行修改。
Note that if you have other columns you don't want to touch, you can do something like a melt
ing operation here. 请注意,如果您不想触摸其他列,则可以在此处执行类似
melt
操作的操作。
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.