简体   繁体   English

Pandas groupby on text:获取每组多个句子的句子编号

[英]Pandas groupby on text : get sentence numbering for multiple sentences per group

My dataframe looks like this:我的数据框如下所示:

    id      sentence                                            ind
    747     A simple and convenient colorimetric method is...   NaN
    747     A simple and convenient colorimetric method is...   NaN
    747     A simple and convenient colorimetric method is...   ulcerative 
    749     Of special significance was the increased acti...   NaN
    749     Of special significance was the increased acti...   NaN
    749     Of special significance was the increased acti...   head injuries
    749     Of special significance was the increased acti...   NaN
    858     Some patients with acute viral hepatitis or pr...   acute viral 
    858     Some patients with acute viral hepatitis or pr...   NaN
    858     Some patients with acute viral hepatitis or pr...   NaN
    948     The other ALP isozyme of FL cells had properti...   NaN
    948     The other ALP isozyme of FL cells had properti...   NaN
    948     The other ALP isozyme of FL cells had properti...   NaN
    948     It was found that a human hepatoma-associated ...   NaN
    948     It was found that a human hepatoma-associated ...   hepatoma
    948     It was found that a human hepatoma-associated ...   NaN
    948     It was more heat stable and more sensitive to ...   virus
    948     It was more heat stable and more sensitive to ...   NaN
    948     It was more heat stable and more sensitive to ...   NaN

I'm using df.groupby(['id', 'sentence']).first().head(20) and I get this:我正在使用df.groupby(['id', 'sentence']).first().head(20) ,我得到了这个:

pmid    sentence                                            ind
747     A simple and convenient colorimetric method is...   NaN
749     Of special significance was the increased acti...   NaN
858     Some patients with acute viral hepatitis or pr...   acute viral 
948      It was found that a human hepatoma-associated...   hepatoma
         It was more heat stable and more sensitive to...   virus

As we see, for id=948 , there are more than one (id-sentence) pairs.如我们所见,对于id=948 ,有不止一对(id-sentence)对。

My question is : Is there a way to get a sentence number for every id in my dataframe, since I have more than one (id-sentence) pairs for one id?我的问题是:有没有办法为我的数据框中的每个 id 获取句子编号,因为我有多个(id-sentence)对用于一个 id?

For example, to have something like:例如,有类似的东西:

id   sentence_nr   sentence                                           ind
747  01            A simple and convenient colorimetric method is...  NaN
749  01            Of special significance was the increased acti...  NaN
858  01            Some patients with acute viral hepatitis or pr...  acute viral 
948  01            It was found that a human hepatoma-associated ...  hepatoma 
948  02            It was more heat stable and more sensitive to ...  virus

You could use GroupBy.cumcount :您可以使用GroupBy.cumcount

df_grouped = df.groupby(['id', 'sentence'], as_index=False).first()
df_grouped['sentence_nr'] = df_grouped.groupby(df_grouped['id']).cumcount() + 1

print(df_grouped)
    id                                           sentence            ind  sentence_nr
0  747  A simple and convenient colorimetric method is...     ulcerative            1
1  749  Of special significance was the increased acti...  head injuries            1
2  858  Some patients with acute viral hepatitis or pr...    acute viral            1
3  948  It was found that a human hepatoma-associated ...       hepatoma            1
4  948  It was more heat stable and more sensitive to ...          virus            2
5  948  The other ALP isozyme of FL cells had properti...           None            3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM