[英]Split sentences in pandas into sentence number and words
I have a pandas dataframe like this: 我有一个像这样的pandas数据帧:
Text start end entity value
I love apple 7 11 fruit apple
I ate potato 6 11 vegetable potato
I have tried to use a for loop It's running slow and I don't think this is what we should do with pandas. 我试图使用for循环它运行缓慢,我不认为这是我们应该用熊猫做的。
I want to create another pandas dataframe base on this like: 我想在此基础上创建另一个pandas数据帧:
Sentence# Word Tag
1 I Object
1 love Object
1 apple fruit
2 I Object
2 ate Object
2 potato vegetable
Split the text column into words and sentence numbers. 将文本列拆分为单词和句子编号。 Other than the entity word, the other words will be tagged as Object. 除实体词外,其他词将被标记为对象。
Use split
, stack
and map
: 使用split
, stack
和map
:
u = df.Text.str.split(expand=True).stack()
pd.DataFrame({
'Sentence': u.index.get_level_values(0) + 1,
'Word': u.values,
'Entity': u.map(dict(zip(df.value, df.entity))).fillna('Object').values
})
Sentence Word Entity
0 1 I Object
1 1 love Object
2 1 apple fruit
3 2 I Object
4 2 ate Object
5 2 potato vegetable
Side note: If running v0.24 or later, please use .to_numpy()
instead of .values
. 附注:如果运行v0.24或更高版本, 请使用.to_numpy()
而不是.values
。
I am using unnesting here after str.split
我使用unnesting后这里str.split
df.Text=df.Text.str.split(' ')
yourdf=unnesting(df,['Text'])
yourdf.loc[yourdf.Text.values!=yourdf.value.values,'entity']='object'
yourdf
Text start end entity value
0 I 7 11 object apple
0 love 7 11 object apple
0 apple 7 11 fruit apple
1 I 6 11 object potato
1 ate 6 11 object potato
1 potato 6 11 vegetable potato
Using the expand
function I posted in this thread , you can 使用我在这个帖子中发布的expand
功能,你可以
df = expand(df, 'Text', sep=' ')
Then simple 然后很简单
df['Tag'] = np.where(df.Text.ne(df.value), ['Object'], df.entity)
>>> df[['Text', 'Tag']]
Text Tag
0 I Object
1 love Object
2 apple fruit
3 I Object
4 ate Object
5 potato vegetable
def expand(df, col, sep=','):
r = df[col].str.split(sep)
d = {c: df[c].values.repeat(r.str.len(), axis=0) for c in df.columns}
d[col] = [i for sub in r for i in sub]
return pd.DataFrame(d)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.