[英]split sentences (strings) in pandas into separate rows of words with sentence numbering
I have a pandas dataframe like this:我有一个 pandas dataframe 像这样:
sn sentence entity
1. an apple is an example of? an apple is example of fruit
2. a potato is an example of? a potato is example of vegetable
I want to create another pandas dataframe that looks like below: where the length of the sentence and entity are the same as below我想创建另一个 pandas dataframe 如下所示: 其中句子和实体的长度与以下相同
Sentence# Word Entity
1 An an
1 apple apple
1 is is
1 an example
1 example of
1 of? fruit
2 A a
2 potato potato
2 is is
2 an example
2 example of
2 of? vegetable
What I have tried so far到目前为止我尝试过的
df = data.sentence.str.split(expand=True).stack()
pd.DataFrame({
'Sentence': df.index.get_level_values(0) + 1,
'Word': df.values,
'Entity':
})
The last bit on "Entity" is what I can't seem to get right “实体”的最后一点是我似乎无法做到的
I also tried to split and stack the entity column, like so?我也尝试拆分和堆叠实体列,像这样?
df2 = data.sentence.str.split(expand=True).stack()
and then attempt to put all back together
pd.DataFrame({
'Sentence': df.index.get_level_values(0) + 1,
'Word': df.values,
'Entity': df2.values
})
but then I get ValueError: arrays are must all be of the same length
但后来我得到
ValueError: arrays are must all be of the same length
len(df) = 536810, len(df2) = 536802
I am new to python.我是 python 的新手。 Any help or pointers appreciated.
任何帮助或指针表示赞赏。
Let us try str.split
then do explode
and concat
back让我们尝试
str.split
然后做explode
并concat
回来
s=df.set_index('sn')
s=pd.concat([s[x].str.split(' ').explode() for x in s.columns],axis=1).reset_index()
s
Out[79]:
sn sentence entity
0 1 an an
1 1 apple apple
2 1 is is
3 1 an example
4 1 example of
5 1 of? fruit
6 2 a a
7 2 potato potato
8 2 is is
9 2 an example
10 2 example of
11 2 of? vegetable
Here is a simple way to do it without explicit iteration -这是一种无需显式迭代的简单方法-
df.set_index('sn').\
applymap(str.split).\
apply(pd.Series.explode, axis=0).\
reset_index()
sn sentence entity
0 1 an an
1 1 apple apple
2 1 is is
3 1 an example
4 1 example of
5 1 of? fruit
6 2 a a
7 2 potato potato
8 2 is is
9 2 an example
10 2 example of
11 2 of? vegetable
One approach without loops一种没有循环的方法
new_df = (df.set_index('sn')
.stack()
.str.split(expand=True)
.stack()
.unstack(level=1)
.reset_index(level=0, drop=0)
)
print(new_df)
Output Output
sn sentence entity
0 1.0 an an
1 1.0 apple apple
2 1.0 is is
3 1.0 an example
4 1.0 example of
5 1.0 of? fruit
0 2.0 a a
1 2.0 potato potato
2 2.0 is is
3 2.0 an example
4 2.0 example of
5 2.0 of? vegetable
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.