简体   繁体   中英

How to modify Data Frame so as to take values between some other character in column in Python Pandas?

I have DataFrame in Python Pandas like below:

COL_1 | COL_2               | COL_3
------|---------------------|---------
111   | CV_COUNT_ABC_XM_BF  | CV_SUM_ABC_XM_BF
222   | CV_COUNT_DEF_XM_BF  | CV_SUM_CC_XM_BF
333   | CV_COUNT_CC_XM_BF   | LACK
444   | LACK                | CV_SUM_DEF_XM_BF
...   | ...                 |  ...

And I need to modify above DataFrame to have in COL_2 and COL_3 values like:

  • if there is "LACK" in COL_2 or COL_3 stay it

  • if there is something other than "LACK" take value:

     between "CV_COUNNT_" and "_XM_BF" or between "CV_SUM_" and "_XM_BF"

So, as a result I need something like below:

COL_1 | COL_2             | COL_3
------|-------------------|---------
111   | ABC               | ABC
222   | DEF               | CC
333   | CC                | LACK
444   | LACK              | DEF
...   | ...               | ...

We can use np.where along with str.extract here:

df["COL_2"] = np.where(df["COL_2"] == 'LACK', df["COL_2"],
                       df["COL_2"].str.extract(r'CV_COUNT_(\w+?)_XM_BF'))
df["COL_3"] = np.where(df["COL_3"] == 'LACK', df["COL_3"],
                   df["COL_3"].str.extract(r'CV_SUM_(\w+?)_XM_BF'))

You can use replace :

df = df.replace({'COL_2': {'CV_COUNT_(\w+)_XM_BF': r'\1'},
                 'COL_3': {'CV_SUM_(\w+)_XM_BF': r'\1'}}, regex=True)
print(df)

# Output
   COL_1 COL_2 COL_3
0    111   ABC   ABC
1    222   DEF    CC
2    333    CC  LACK
3    444  LACK   DEF

A shortest version: df.replace({'CV_(?:COUNT|SUM)_(\w+)_XM_BF': r'\1'}, regex=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM