简体   繁体   English

在存在NaN的情况下将pandas柱拆分成新柱

[英]Split pandas column into new columns in presence of NaN

I have a pandas DataFrame containing a string column which needs splitting into two separate columns. 我有一个包含字符串列的pandas DataFrame,需要拆分成两个单独的列。 The answer using tolist that I found on SO from this question works like a charm, except when my column contains NaNs. 我在这个问题上找到的使用tolist的答案就像魅力一样,除非我的专栏包含NaN。 The excerpt below describes the difficulty: 下面的摘录描述了难度:

import pandas as pd
import numpy as np

# Example DataFrame
df = pd.DataFrame([[25.0, '34.2/ 18.1', 'one'],
                   [32.6, '28.6/ 17.9', 'two'],
                   [12.5, '30.1/ 17.6', 'three']], columns=['A', 'B', 'C'])
df2 = df.copy()

# This method works when all data are present
df['D'] = pd.DataFrame(df['B'].str.split('/').tolist())[1]

# However, when there are NaNs:
df2['B'][0] = np.nan

# This line fails
df2['D'] = pd.DataFrame(df2['B'].str.split('/').tolist())[1]

It gives me a KeyError , because the intermediate DataFrame only has one column, indicating that the bother of going to a list and back doesn't accomplish anything anymore: 它给了我一个KeyError ,因为中间数据框中只有一列,表明懒得去一个列表和背部已经不完成任何事情:

               0
0            NaN
1  [28.6,  17.9]
2  [30.1,  17.6]

I've tried dropping the NaN first via pd.DataFrame(df2['B'].str.split('/').dropna().tolist()) , but then I lose my index ... I need to keep the NaN at index 0. I've also thought of somehow duplicating the NaN in the creation of the intermediate DataFrame to force the two columns, but am having no luck. 我已经尝试通过pd.DataFrame(df2['B'].str.split('/').dropna().tolist())首先删除NaN,但后来我丢失了我的索引......我需要将NaN保持在索引0.我还想到在创建中间DataFrame时以某种方式复制NaN以强制使用两列,但我没有运气。

This is what I would need my data to look like for df2: 这就是我需要df2的数据:

      A           B      C     D
0  25.0         NaN    one   NaN
1  32.6  28.6/ 17.9    two  17.9
2  12.5  30.1/ 17.6  three  17.6

Is there a way to do this without using a list as an intermediary? 有没有办法在不使用列表作为中介的情况下执行此操作? Or somehow deal with the NaN? 或者以某种方式处理NaN?

You can continue to use your method if you use the str accessor again after the split (instead of using tolist() and making another DataFrame): 如果在拆分后再次使用str访问器(而不是使用tolist()并生成另一个DataFrame),则可以继续使用您的方法:

>>> df2['D'] = df2['B'].str.split('/').str[-1]
>>> df2
      A           B      C      D
0  25.0         NaN    one    NaN
1  32.6  28.6/ 17.9    two   17.9
2  12.5  30.1/ 17.6  three   17.6

This returns NaN if the index doesn't exist, instead of raising the error. 如果索引不存在,则返回NaN ,而不是引发错误。

The str.extract method allows you to provide a regex pattern. str.extract方法允许您提供正则表达式模式。 Each group in the pattern is returned as a separate column. 模式中的每个组都作为单独的列返回。 NaN is used when no match is found: 找不到匹配项时使用NaN

df2['D'] = df2['B'].str.extract(r'/(.*)')
print(df2)

yields 产量

      A           B      C      D
0  25.0         NaN    one    NaN
1  32.6  28.6/ 17.9    two   17.9
2  12.5  30.1/ 17.6  three   17.6

Note that if you want the D column to be treated as floats, then you'll also need to call astype : 请注意,如果您希望将D列视为浮点数,那么您还需要调用astype

df2['D'] = df2['D'].astype('float')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM