简体   繁体   中英

Pandas `groupby` seems to only apply function to first group

I have a pandas DataFrame with part-of-speech tags that I am trying to build a part-of-speech tagger with. It looks something like this.

silly_df = pd.DataFrame.from_dict({"INDEX":[1, 1, 1, 2, 2, 2, 2, 2], 
                        "LABEL": ['X', 'Y', 'Z', 'Z', 'Z', 'X', 'X', 'Y']})

which looks like:

   INDEX LABEL
0      1     X
1      1     Y
2      1     Z
3      2     Z
4      2     Z
5      2     X
6      2     X
7      2     Y

The INDEX column groups tokens together, and each token has a label.

However, I would like to modify the labels to improve the performance of my model. I would like to convert each "Z" into either "BZ" or "IZ" , where "BZ" indicates that we are at the b eginning of a (possibly length-1) string of Z 's, and " IZ " indicates that we are on the i nside (or possibly the end) of a (length >1) string of "Z" 's. All of this transformation should take place within indices, so that the desired output is

   INDEX LABEL  NEW_LABEL
0      1     X          X
1      1     Y          Y
2      1     Z        B_Z
3      2     Z        B_Z
4      2     Z        I_Z
5      2     X          X
6      2     X          X
7      2     Y          Y

I have written some code which performs this relabeling on a single list of labels within one index level:

import itertools
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


def add_b_i(beg, inside, match, labels):
    for i, (s, t) in enumerate(pairwise(labels)):
        if t == match:
            if s != match:
                labels[i+1] = beg
            else:
                labels[i+1] = inside
    return labels

Now I would like this to apply this function groupwise, but when I try, I get:

silly_df.groupby('INDEX')['LABEL'].transform(lambda x: add_b_i('B-Z', 'I-Z', 'Z', x))

Output:

0      X
1      Y
2    B-Z
3      Z
4      Z
5      X
6      X
7      Y

It seems to only be applying the function to the first group. How come?

You can try this vectorized approach, (usually you don't need to enumerate on a Series object since it contains index already):

import pandas as pd
import numpy as np

def add_b_i(beg, inside, match, labels):
    match_logic = labels == match
    match_count = match_logic.cumsum()
    return labels.where(~match_logic, 
                        np.where(match_logic & (match_count == 1), beg, inside))

silly_df.groupby('INDEX')['LABEL'].transform(lambda x: add_b_i('B-Z', 'I-Z', 'Z', x))

#0      X
#1      Y
#2    B-Z
#3    B-Z
#4    I-Z
#5      X
#6      X
#7      Y
#Name: LABEL, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM