简体   繁体   English

从索引中拆分列的字符串值并在 Pandas DataFrame 中填充 NaN

[英]Spliting strings values of a column out of index and fill with NaN in a Pandas DataFrame

I've a DataFrame like this:我有这样的 DataFrame:

ROW_A    ROW_B
1        tata+toto
2        tata+toto
3        tata+toto
4        ti+tu+te
5        ti+tu+te
6        ti+tu+te
7        ti+tu+te

I want to split ROW_B values in a new row.我想在新行中拆分 ROW_B 值。 I know that the length of values does not match length of index but I just want to split the values and fill last values with NaN like this:我知道值的长度与索引的长度不匹配,但我只想拆分值并用 NaN 填充最后一个值,如下所示:

ROW_A    ROW_B       ROW_C
1        tata+toto   tata
2        tata+toto   toto
3        tata+toto   NaN
4        ti+tu+te    ti
5        ti+tu+te    tu
6        ti+tu+te    te
7        ti+tu+te    NaN

I tried this code:我试过这段代码:

df_columns = df.columns
row_b = df_columns[1]

df['ROW_C'] = df.groupby('ROW_A')[row_b].transform(lambda x:x.head(1).str.split('+').explode().values)).fillna

Here is the error message:这是错误消息:

ValueError: Length of values (2) does not match length of index (3)

One option is to drop_duplicates + str.split + explode to create a temporary Series.一种选择是drop_duplicates + str.split + explode来创建一个临时系列。 Then reindex this with df.index to get the NaNs:然后用df.index重新索引它以获得 NaN:

tmp = df['ROW_B'].drop_duplicates().str.split('+').explode()
df['ROW_C'] = tmp.set_axis(tmp.groupby(level=0).cumcount().pipe(lambda x: x+x.index), axis=0).reindex(df.index)

Another option is to use groupby + cumcount to create group numbers, then index the list in each row using the group number.另一种选择是使用groupby + cumcount创建组号,然后使用组号为每行中的列表编制索引。 Since the group number exceeds the list length, wrap it in try-except:由于组号超过列表长度,所以用try-except包起来:

out = []
for i, lst in zip(df.groupby('ROW_B').cumcount(), df['ROW_B'].str.split('+')):
    try:
        out.append(lst[i])
    except IndexError:
        out.append(float('nan'))

Output: Output:

   ROW_A      ROW_B ROW_C
0      1  tata+toto  tata
1      2  tata+toto  toto
2      3  tata+toto   NaN
3      4   ti+tu+te    ti
4      5   ti+tu+te    tu
5      6   ti+tu+te    te
6      7   ti+tu+te   NaN

You could group by column ROW_B and then create a new column on each of the groups -您可以按列ROW_B ,然后在每个组上创建一个新列 -

from itertools import zip_longest

recons_df = []
for k, g in df.groupby('ROW_B'):
    g.loc[:, 'ROW_C'] = list(x if x else y for (x, y) in zip_longest(k.split('+'), [np.nan]*g.index.size))
    recons_df.append(g)

recons_df = pd.concat(recons_df)
print(recons_df)
#   ROW_A      ROW_B ROW_C
#0      1  tata+toto  tata
#1      2  tata+toto  toto
#2      3  tata+toto   NaN
#3      4   ti+tu+te    ti
#4      5   ti+tu+te    tu
#5      6   ti+tu+te    te
#6      7   ti+tu+te   NaN

In case you don't care about the NaN for every missing split, use -如果您不关心每个丢失的拆分的NaN ,请使用 -

df.merge(df['ROW_B'].str.split('+', expand=True).stack().reset_index(), left_on=[df.index], right_on=['level_0']).drop(['level_0', 'level_1'], axis=1).rename({0: 'ROW_C'}, axis=1)

Output Output

    ROW_A      ROW_B ROW_C
0       1  tata+toto  tata
1       1  tata+toto  toto
2       2  tata+toto  tata
3       2  tata+toto  toto
4       3  tata+toto  tata
5       3  tata+toto  toto
6       4   ti+tu+te    ti
7       4   ti+tu+te    tu
8       4   ti+tu+te    te
9       5   ti+tu+te    ti
10      5   ti+tu+te    tu
11      5   ti+tu+te    te
12      6   ti+tu+te    ti
13      6   ti+tu+te    tu
14      6   ti+tu+te    te
15      7   ti+tu+te    ti
16      7   ti+tu+te    tu
17      7   ti+tu+te    te

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM