I have a pandas DataFrame with part-of-speech tags that I am trying to build a part-of-speech tagger with. It looks something like this.
silly_df = pd.DataFrame.from_dict({"INDEX":[1, 1, 1, 2, 2, 2, 2, 2],
"LABEL": ['X', 'Y', 'Z', 'Z', 'Z', 'X', 'X', 'Y']})
which looks like:
INDEX LABEL
0 1 X
1 1 Y
2 1 Z
3 2 Z
4 2 Z
5 2 X
6 2 X
7 2 Y
The INDEX
column groups tokens together, and each token has a label.
However, I would like to modify the labels to improve the performance of my model. I would like to convert each "Z"
into either "BZ"
or "IZ"
, where "BZ"
indicates that we are at the b eginning of a (possibly length-1) string of Z
's, and " IZ
" indicates that we are on the i nside (or possibly the end) of a (length >1) string of "Z"
's. All of this transformation should take place within indices, so that the desired output is
INDEX LABEL NEW_LABEL
0 1 X X
1 1 Y Y
2 1 Z B_Z
3 2 Z B_Z
4 2 Z I_Z
5 2 X X
6 2 X X
7 2 Y Y
I have written some code which performs this relabeling on a single list of labels within one index level:
import itertools
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
def add_b_i(beg, inside, match, labels):
for i, (s, t) in enumerate(pairwise(labels)):
if t == match:
if s != match:
labels[i+1] = beg
else:
labels[i+1] = inside
return labels
Now I would like this to apply this function groupwise, but when I try, I get:
silly_df.groupby('INDEX')['LABEL'].transform(lambda x: add_b_i('B-Z', 'I-Z', 'Z', x))
Output:
0 X
1 Y
2 B-Z
3 Z
4 Z
5 X
6 X
7 Y
It seems to only be applying the function to the first group. How come?
You can try this vectorized approach, (usually you don't need to enumerate on a Series object since it contains index already):
import pandas as pd
import numpy as np
def add_b_i(beg, inside, match, labels):
match_logic = labels == match
match_count = match_logic.cumsum()
return labels.where(~match_logic,
np.where(match_logic & (match_count == 1), beg, inside))
silly_df.groupby('INDEX')['LABEL'].transform(lambda x: add_b_i('B-Z', 'I-Z', 'Z', x))
#0 X
#1 Y
#2 B-Z
#3 B-Z
#4 I-Z
#5 X
#6 X
#7 Y
#Name: LABEL, dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.