简体   繁体   English

熊猫-如何拆分和合并名称相似的列?

[英]Pandas- how to split and merge columns with similar names?

I have a dirty dataframe which needs column cleaning. 我有一个肮脏的数据框,需要清洗列。 Basically, there are a LOT of columns which contain combination data which should not be and slight spelling differences! 基本上,有很多列包含不应该包含的组合数据和轻微的拼写差异! For example: 例如:

         1    1/2    2c     2 c     
row
1       B     nan    C       nan 
2       B     nan    C       nan
3       nan   Rb     nan     nan
4       c     nan    nan     C

to something like this: 像这样:

         1    2c    
row
1       B     C       
2       B     C       
3       Rb    Rb   
4       c     C

Thus the issue is two fold, how do you merge columns which are split on fuzzy logic similarity and how do you split then merge on columns which have combo values? 因此,问题是双重的,如何合并基于模糊逻辑相似性拆分的列,以及如何拆分然后合并具有组合值的列?

The only way I know how to do this would be to create a new column which uses the .apply function to apply if statements, but given that the number of columns is in the 100s this would be painful. 我知道如何执行此操作的唯一方法是创建一个新列,该新列使用.apply函数来应用if语句,但是鉴于列数在100s之内,这将很痛苦。 Any ideas for a less manual solution? 有什么想法可以减少手动解决方案吗?

Try 尝试

d0 = df.filter(regex='/')      # Grab the columns with "/" in name
d1 = df.drop(d0, 1)            # Drop those columns

a = d0.to_numpy()              
m = d0.columns.str.count('/')  # Count the number of "/".

d2 = pd.DataFrame(
    a.repeat(m + 1, axis=1),   # Repeat the columns one more time than the # of "/"
    d0.index,
    np.concatenate(d0.columns.str.split('/')) 
)

d3 = pd.concat([d1, d2], axis=1)  # Smash them back together

# Grab the first bit of the column name as long as they are digits
# Group by that and take the first non-null value
d3.groupby(np.ravel(d3.columns.str.extract('(\d+)')), axis=1).first()

    1   2
1   B   C
2   B   C
3  Rb  Rb
4   c   C

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM