[英]Spliting strings values of a column out of index and fill with NaN in a Pandas DataFrame
I've a DataFrame like this:我有这样的 DataFrame:
ROW_A ROW_B
1 tata+toto
2 tata+toto
3 tata+toto
4 ti+tu+te
5 ti+tu+te
6 ti+tu+te
7 ti+tu+te
I want to split ROW_B values in a new row.我想在新行中拆分 ROW_B 值。 I know that the length of values does not match length of index but I just want to split the values and fill last values with NaN like this:
我知道值的长度与索引的长度不匹配,但我只想拆分值并用 NaN 填充最后一个值,如下所示:
ROW_A ROW_B ROW_C
1 tata+toto tata
2 tata+toto toto
3 tata+toto NaN
4 ti+tu+te ti
5 ti+tu+te tu
6 ti+tu+te te
7 ti+tu+te NaN
I tried this code:我试过这段代码:
df_columns = df.columns
row_b = df_columns[1]
df['ROW_C'] = df.groupby('ROW_A')[row_b].transform(lambda x:x.head(1).str.split('+').explode().values)).fillna
Here is the error message:这是错误消息:
ValueError: Length of values (2) does not match length of index (3)
One option is to drop_duplicates
+ str.split
+ explode
to create a temporary Series.一种选择是
drop_duplicates
+ str.split
+ explode
来创建一个临时系列。 Then reindex this with df.index
to get the NaNs:然后用
df.index
重新索引它以获得 NaN:
tmp = df['ROW_B'].drop_duplicates().str.split('+').explode()
df['ROW_C'] = tmp.set_axis(tmp.groupby(level=0).cumcount().pipe(lambda x: x+x.index), axis=0).reindex(df.index)
Another option is to use groupby
+ cumcount
to create group numbers, then index the list in each row using the group number.另一种选择是使用
groupby
+ cumcount
创建组号,然后使用组号为每行中的列表编制索引。 Since the group number exceeds the list length, wrap it in try-except:由于组号超过列表长度,所以用try-except包起来:
out = []
for i, lst in zip(df.groupby('ROW_B').cumcount(), df['ROW_B'].str.split('+')):
try:
out.append(lst[i])
except IndexError:
out.append(float('nan'))
Output: Output:
ROW_A ROW_B ROW_C
0 1 tata+toto tata
1 2 tata+toto toto
2 3 tata+toto NaN
3 4 ti+tu+te ti
4 5 ti+tu+te tu
5 6 ti+tu+te te
6 7 ti+tu+te NaN
You could group by column ROW_B
and then create a new column on each of the groups -您可以按列
ROW_B
,然后在每个组上创建一个新列 -
from itertools import zip_longest
recons_df = []
for k, g in df.groupby('ROW_B'):
g.loc[:, 'ROW_C'] = list(x if x else y for (x, y) in zip_longest(k.split('+'), [np.nan]*g.index.size))
recons_df.append(g)
recons_df = pd.concat(recons_df)
print(recons_df)
# ROW_A ROW_B ROW_C
#0 1 tata+toto tata
#1 2 tata+toto toto
#2 3 tata+toto NaN
#3 4 ti+tu+te ti
#4 5 ti+tu+te tu
#5 6 ti+tu+te te
#6 7 ti+tu+te NaN
In case you don't care about the NaN
for every missing split, use -如果您不关心每个丢失的拆分的
NaN
,请使用 -
df.merge(df['ROW_B'].str.split('+', expand=True).stack().reset_index(), left_on=[df.index], right_on=['level_0']).drop(['level_0', 'level_1'], axis=1).rename({0: 'ROW_C'}, axis=1)
Output Output
ROW_A ROW_B ROW_C
0 1 tata+toto tata
1 1 tata+toto toto
2 2 tata+toto tata
3 2 tata+toto toto
4 3 tata+toto tata
5 3 tata+toto toto
6 4 ti+tu+te ti
7 4 ti+tu+te tu
8 4 ti+tu+te te
9 5 ti+tu+te ti
10 5 ti+tu+te tu
11 5 ti+tu+te te
12 6 ti+tu+te ti
13 6 ti+tu+te tu
14 6 ti+tu+te te
15 7 ti+tu+te ti
16 7 ti+tu+te tu
17 7 ti+tu+te te
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.