简体   繁体   中英

Cleaning and Manipulating a column using pandas

I have the following column in my dataset, the data comes in as-is from my data source:

Salary
~£2000
~£2000.15 per week
~£2000.50 per month
~£2000 - ~£5000 range
100000INR
INR

Now I want to create a new column that should look like this:

Salary_clean
2000
104007.8
240006
35000
964
0

So the below logic will follow(all salareis are annual eventually once clearned):

  1. When the column has a standalone number that means the salary is already presented annually and require no action
  2. when salary has "per week" written on the side, then multiply that salary by 52
  3. when salary has "per month" written on the side, then multiply that salary by 12
  4. when salary has "xy range" written on the side, then calculate the median of the range and that would be the correct salary
  5. when salary has "XXX currency" written on the side like INR, then calculate the salary by using the current conversion rate of that currency to GBP(Pounds)
  6. When salary has just a currency code like "XXX", then put salary as 0

How can I achieve this?

Disclaimer : this code can be dangerous (the eval function is used without any caution). In addition, the code is totally under optimized but has the advantage of being compact.

d = {r'~[^\d]+': r'',
     r'per week': r'* 52',
     r'per month': r'* 12',
     r'(.*) - (.*) range': r'(\1 + \2) / 2',
     r'\dINR': r' * 0.0096',
     r'^[^\W\d]*$': r'0'}

df['Salary_clean'] = df['Salary'].replace(d, regex=True).apply(eval)
>>> df
                  Salary  Salary_clean
0                 ~£2000        2000.0
1     ~£2000.15 per week      104007.8
2    ~£2000.50 per month       24006.0
3  ~£2000 - ~£5000 range        3500.0
4              100000INR          96.0
5                    INR           0.0

Result of replace method:

>>> df['Salary'].replace(d, regex=True)

0                 2000
1         2000.15 * 52
2         2000.50 * 12
3    (2000 + 5000) / 2
4       10000 * 0.0096
5                    0
Name: Salary, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM