简体   繁体   English

将 pandas 中的句子(字符串)拆分为带有句子编号的单独单词行

[英]split sentences (strings) in pandas into separate rows of words with sentence numbering

I have a pandas dataframe like this:我有一个 pandas dataframe 像这样:

sn  sentence                    entity
1.  an apple is an example of?  an apple is example of fruit
2.  a potato is an example of?  a potato is example of vegetable

I want to create another pandas dataframe that looks like below: where the length of the sentence and entity are the same as below我想创建另一个 pandas dataframe 如下所示: 其中句子和实体的长度与以下相同

Sentence#   Word    Entity
  1         An      an 
  1         apple   apple
  1         is      is
  1         an      example
  1         example of 
  1         of?     fruit
  2         A       a 
  2         potato  potato
  2         is      is
  2         an      example
  2         example of
  2         of?     vegetable

What I have tried so far到目前为止我尝试过的

df = data.sentence.str.split(expand=True).stack()

pd.DataFrame({
    'Sentence': df.index.get_level_values(0) + 1, 
    'Word': df.values, 
    'Entity': 
})

The last bit on "Entity" is what I can't seem to get right “实体”的最后一点是我似乎无法做到的

I also tried to split and stack the entity column, like so?我也尝试拆分和堆叠实体列,像这样?

df2 = data.sentence.str.split(expand=True).stack() 

and then attempt to put all back together

pd.DataFrame({
    'Sentence': df.index.get_level_values(0) + 1, 
    'Word': df.values, 
    'Entity': df2.values
})

but then I get ValueError: arrays are must all be of the same length但后来我得到ValueError: arrays are must all be of the same length

len(df) = 536810, len(df2) = 536802

I am new to python.我是 python 的新手。 Any help or pointers appreciated.任何帮助或指针表示赞赏。

Let us try str.split then do explode and concat back让我们尝试str.split然后做explodeconcat回来

s=df.set_index('sn')
s=pd.concat([s[x].str.split(' ').explode() for x in s.columns],axis=1).reset_index()
s
Out[79]: 
    sn sentence     entity
0    1       an         an
1    1    apple      apple
2    1       is         is
3    1       an    example
4    1  example         of
5    1      of?      fruit
6    2        a          a
7    2   potato     potato
8    2       is         is
9    2       an    example
10   2  example         of
11   2      of?  vegetable

Here is a simple way to do it without explicit iteration -这是一种无需显式迭代的简单方法-

  1. Set sn to index将 sn 设置为索引
  2. Applymap string split to each of the cells in dataframe Applymap字符串拆分到dataframe中的每个单元格
  3. Explode the lists over axis 0在轴 0 上展开列表
  4. Reset index重置索引
df.set_index('sn').\
applymap(str.split).\
apply(pd.Series.explode, axis=0).\
reset_index()

    sn sentence     entity
0    1       an         an
1    1    apple      apple
2    1       is         is
3    1       an    example
4    1  example         of
5    1      of?      fruit
6    2        a          a
7    2   potato     potato
8    2       is         is
9    2       an    example
10   2  example         of
11   2      of?  vegetable

One approach without loops一种没有循环的方法

new_df = (df.set_index('sn')
            .stack()
            .str.split(expand=True)
            .stack()
            .unstack(level=1)
            .reset_index(level=0, drop=0)
                        )
print(new_df)

Output Output

    sn sentence     entity
0  1.0       an         an
1  1.0    apple      apple
2  1.0       is         is
3  1.0       an    example
4  1.0  example         of
5  1.0      of?      fruit
0  2.0        a          a
1  2.0   potato     potato
2  2.0       is         is
3  2.0       an    example
4  2.0  example         of
5  2.0      of?  vegetable

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将熊猫中的句子分成句子数和单词 - Split sentences in pandas into sentence number and words 如何使用 pandas 将句子拆分为句子 ID、单词和标签? - How to split sentences into sentence Id, words and labels with pandas? Pandas groupby on text:获取每组多个句子的句子编号 - Pandas groupby on text : get sentence numbering for multiple sentences per group 拆分pandas包含多行字符串的系列行分成不同的行 - Split pandas Series rows containing multiline strings into separate rows 将句子分成单独的字符串,其中句子以大写字母开头 - Split sentences into separate strings where sentences start with capital letter 拆分句子,处理单词并将句子重新组合在一起? - Split sentences, process words, and put sentence back together? 将句子拆分为单词 pandas 并保留标签 - Split sentence into words pandas and keep tags 基于多个句子的句子中的单词对句子进行分类 - categorize sentence based on words in sentence for multiple sentences 使用熊猫将句子拆分为包含不同数量单词的子字符串 - Split sentences into substrings containing varying number of words using pandas 如何将句子字符串拆分为单词,还使标点符号成为一个单独的元素 - How to split a sentence string into words, but also make punctuation a separate element
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM