简体   繁体   English

使用熊猫删除/替换行中的字符后替换数据框值

[英]Replacing dataframe values after removing/replacing character in rows using Pandas

I have a dataframe df_in like so: 我有一个数据df_in像这样:

import pandas as pd
import numpy as np
dic_in = {'A':['aa','bb','cc','dd','ee','ff','gg','uu','xx','yy','zz'],
       'B':['200','200','AA200','AA040',np.nan,'500',np.nan,'0700','900','UKK','200'],
       'C':['UNN','400',np.nan,'AA080','AA800','B',np.nan,'400',np.nan,'500','UKK']}

My goal is to investigate column B and C in such a way that: 我的目标是以下列方式调查BC栏:

  • If one of the items contains the following character 'AA' , then the number such part of the string must be removed leaving only the numeric part. 如果其中一项包含以下字符'AA' ,则必须删除字符串中此类部分的数字,仅保留数字部分。 ( AA123 ---> 123 ). AA123 ---> 123 )。 If a zeros are present before the first non null element, they must be removed ( AA001234 ---> 1234 ). 如果在第一个非null元素之前存在零,则必须将其删除( AA001234 ---> 1234 )。
  • if the quantity is not a number then it must be set to 0.0 ( NaN ---> 0.0 , UNN ----> 0.0 , UKK ---> 0.0 and so on). 如果数量不是数字,则必须将其设置为0.0NaN ---> 0.0UNN ----> 0.0UKK ---> 0.0等)。
  • if an item has leading zeros before, then they must be deleted ( 070--->700 , 00007000--->7000 ) 如果某项之前具有前导零,则必须将其删除( 070--->700 00007000--->7000
  • If an item has been modified and is non-zero then it must be multiplied by 100 . 如果一项已被修改且非零,则必须乘以100

The final result should look like this: 最终结果应如下所示:

   # BEFORE #                     # AFTER #
     A      B      C               A      B      C
0   aa    200    UNN          0   aa    200    0.0
1   bb    200    400          1   bb    200    400
2   cc  AA200    NaN          2   cc  20000    0.0
3   dd  AA040  AA080          3   dd   4000   8000
4   ee    NaN  AA800          4   ee    0.0  80000
5   ff    500      B          5   ff    500    0.0
6   gg    NaN    NaN          6   gg    0.0    0.0
7   uu   0700    400          7   uu    700    400
8   xx    900    NaN          8   xx    900    0.0
9   yy    UKK    500          9   yy    0.0    500
10  zz    200    UKK          10  zz    200    0.0

Do you know a smart and efficient way to achieve such goal? 您知道实现这一目标的明智而有效的方法吗?

Notice : all the numbers are in reality string and they should remain as so. 注意 :所有数字实际上都是字符串,应该保持原样。

You can use to_numeric for replace not numeric to NaN . 您可以使用to_numeric将非数字替换为NaN

Then extract numbers from strings, remove 0 from left by lstrip and add 00 . 然后从字符串中extract数字,将lstrip左边的0删除,然后添加00

Last combine_first with fillna and assign to columns: 最后将combine_firstfillna并分配给列:

b = pd.to_numeric(df_in.B, errors='coerce')
c = pd.to_numeric(df_in.C, errors='coerce')

b1 = df_in.B.str.extract('(\d+)', expand=False).str.lstrip('0') + '00'
c1 = df_in.C.str.extract('(\d+)', expand=False).str.lstrip('0') + '00'

df_in.B = b.combine_first(b1).fillna(0)
df_in.C = c.combine_first(c1).fillna(0)
print (df_in)
     A      B      C
0   aa    200      0
1   bb    200    400
2   cc  20000      0
3   dd   4000   8000
4   ee      0  80000
5   ff    500      0
6   gg      0      0
7   uu    700    400
8   xx    900      0
9   yy      0    500
10  zz    200      0

A bit modified solution last fillna by string 0.0 convert all values to strings (avoid some strings and some numeric values): 字符串0.0最后一个fillna的位修改后的解决方案将所有值转换为字符串(避免使用某些字符串和某些数字值):

b = pd.to_numeric(df_in.B, errors='coerce')
c = pd.to_numeric(df_in.C, errors='coerce')

b1 = df_in.B.str.extract('(\d+)', expand=False).str.lstrip('0') + '00'
c1 = df_in.C.str.extract('(\d+)', expand=False).str.lstrip('0') + '00'

df_in.B = b.combine_first(b1)
df_in.C = c.combine_first(c1)

df_in = df_in.fillna('0.0').astype(str)
print (df_in)
     A      B      C
0   aa  200.0    0.0
1   bb  200.0  400.0
2   cc  20000    0.0
3   dd   4000   8000
4   ee    0.0  80000
5   ff  500.0    0.0
6   gg    0.0    0.0
7   uu  700.0  400.0
8   xx  900.0    0.0
9   yy    0.0  500.0
10  zz  200.0    0.0

Assuming that all the values in your dataframe are strings (including the NaN s, otherwise you can convert them to an appropriate string with fillna ), you can use the following converter function with applymap on the two columns you want to convert. 假设数据框中的所有值都是字符串(包括NaN ,否则可以使用fillna将它们转换为适当的字符串),则可以在要转换的两列applymap以下converter函数与applymap一起使用。

df = pd.DataFrame(dic_in, dtype=str).fillna('NAN')

converter = lambda x: str(int(x.replace('AA', ''))*100) if 'AA' in x else str(int(x)) if x.isdigit() else '0.0'

df[['B','C']] = df[['B','C']].applymap(converter)

contents of df : df内容:

     A      B      C
0   aa    200    0.0
1   bb    200    400
2   cc  20000    0.0
3   dd   4000   8000
4   ee    0.0  80000
5   ff    500    0.0
6   gg    0.0    0.0
7   uu    700    400
8   xx    900    0.0
9   yy    0.0    500
10  zz    200    0.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM