简体   繁体   English

将熊猫中的句子分成句子数和单词

[英]Split sentences in pandas into sentence number and words

I have a pandas dataframe like this: 我有一个像这样的pandas数据帧:

Text            start    end    entity     value
I love apple      7       11    fruit      apple
I ate potato      6       11    vegetable  potato

I have tried to use a for loop It's running slow and I don't think this is what we should do with pandas. 我试图使用for循环它运行缓慢,我不认为这是我们应该用熊猫做的。

I want to create another pandas dataframe base on this like: 我想在此基础上创建另一个pandas数据帧:

Sentence#         Word        Tag
  1                I         Object 
  1               love       Object
  1               apple      fruit
  2                I         Object
  2               ate        Object
  2               potato     vegetable

Split the text column into words and sentence numbers. 将文本列拆分为单词和句子编号。 Other than the entity word, the other words will be tagged as Object. 除实体词外,其他词将被标记为对象。

Use split , stack and map : 使用splitstackmap

u = df.Text.str.split(expand=True).stack()

pd.DataFrame({
    'Sentence': u.index.get_level_values(0) + 1, 
    'Word': u.values, 
    'Entity': u.map(dict(zip(df.value, df.entity))).fillna('Object').values
})

   Sentence    Word     Entity
0         1       I     Object
1         1    love     Object
2         1   apple      fruit
3         2       I     Object
4         2     ate     Object
5         2  potato  vegetable

Side note: If running v0.24 or later, please use .to_numpy() instead of .values . 附注:如果运行v0.24或更高版本, 请使用.to_numpy()而不是.values

I am using unnesting here after str.split 我使用unnesting后这里str.split

df.Text=df.Text.str.split(' ')
yourdf=unnesting(df,['Text'])
yourdf.loc[yourdf.Text.values!=yourdf.value.values,'entity']='object'
yourdf
     Text  start  end     entity   value
0       I      7   11     object   apple
0    love      7   11     object   apple
0   apple      7   11      fruit   apple
1       I      6   11     object  potato
1     ate      6   11     object  potato
1  potato      6   11  vegetable  potato

Using the expand function I posted in this thread , you can 使用我在这个帖子中发布expand功能,你可以

df = expand(df, 'Text', sep=' ')

Then simple 然后很简单

df['Tag'] = np.where(df.Text.ne(df.value), ['Object'], df.entity)


>>> df[['Text', 'Tag']]

    Text    Tag
0   I       Object
1   love    Object
2   apple   fruit
3   I       Object
4   ate     Object
5   potato  vegetable

def expand(df, col, sep=','):
    r = df[col].str.split(sep)
    d = {c: df[c].values.repeat(r.str.len(), axis=0) for c in df.columns}
    d[col] = [i for sub in r for i in sub]
    return pd.DataFrame(d)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 pandas 将句子拆分为句子 ID、单词和标签? - How to split sentences into sentence Id, words and labels with pandas? 将 pandas 中的句子(字符串)拆分为带有句子编号的单独单词行 - split sentences (strings) in pandas into separate rows of words with sentence numbering 使用熊猫将句子拆分为包含不同数量单词的子字符串 - Split sentences into substrings containing varying number of words using pandas 拆分句子,处理单词并将句子重新组合在一起? - Split sentences, process words, and put sentence back together? 将句子拆分为单词 pandas 并保留标签 - Split sentence into words pandas and keep tags len(sentence.split()) 用于查找句子中单词数的改进 - Improvement on len(sentence.split()) for finding the number of words in a sentence 基于多个句子的句子中的单词对句子进行分类 - categorize sentence based on words in sentence for multiple sentences 在拆分句子(pandas)上使用isin时如何获得单词的出现? - How to get the occurrence of words while using isin on a split sentence (pandas)? Python:用单词列表替换句子中的一个单词,并将新句子放在 pandas 的另一列中 - Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandas 拆分几个句子在 pandas dataframe - Split several sentences in pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM