简体   繁体   中英

Python Pandas Extracting specific data from column

I have one column as an object contains multiple data separated by ( | ) I would like to extract only the customer order number which is start with ( 44 ) sometimes the order number in the beginning, sometimes in the middle, sometimes in the end And sometimes is duplicated

44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 

87186978110202440 | 44019119315 | 87186978110202440 | 44019119315

87186978110326832 | 44019453624 | 87186978110326832 | 44019453624

44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|

44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 

Wishing result

44019541285

44019119315

44019453624

44019406029
 
44019480564
import pandas as pd
df = pd.DataFrame({
    'order_number':[
      '44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032',
      '87186978110202440 | 44019119315 | 87186978110202440 | 44019119315',
      '87186978110326832 | 44019453624 | 87186978110326832 | 44019453624',
      '44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|',
      '44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498'
    ]
})

def extract_customer_order(order_number):
  order_number = order_number.replace(' ','') # remove all space to make it easy to process e.g. '44019541285_P_002 | 0317209757 ' -> '44019541285_P_002|0317209757'
  order_number_list = order_number.split('|') # split the string at every | to multiple string in list '44019541285_P_002|0317209757' -> ['44019541285_P_002', '0317209757']
  result = []
  for order in order_number_list:
    if order.startswith('44'): # select only order number starting with '44'
      if order not in result: # to prevent duplicate order number
        result += [order]
  # if you want the result as string separated by '|', uncomment line below
  # result = '|'.join(result)
  return result


df['customer_order'] = df['order_number'].apply(extract_customer_order)

My code:

import pandas as pd
from io import StringIO

data = '''
Order_Numbers
44019541285_P_002 | 0317209757 | 87186978110350851 | 387186978103840544 | 0652569032
87186978110202440 | 44019119315 | 87186978110202440 | 44019119315
87186978110326832 | 44019453624 | 87186978110326832 | 44019453624
44019406029 | 0317196878 | 87186978110313085 | 387186978120481881|
44019480564 | 0317202711 | 87186978110335810 | 387186978103844160 | 630552498
'''

df = pd.read_csv(StringIO(data.replace(' ','')))
df
'''
                                       Order_Numbers
0  44019541285_P_002|0317209757|87186978110350851...
1  87186978110202440|44019119315|8718697811020244...
2  87186978110326832|44019453624|8718697811032683...
3  44019406029|0317196878|87186978110313085|38718...
4  44019480564|0317202711|87186978110335810|38718...
'''

Final code:

(
    df.Order_Numbers.str.split('|', expand=True)
    .astype(str)
    .where(lambda x: x.applymap(lambda y: y[:2] == '44'))
    .bfill(axis=1)
    [0]
    .str.split('_').str.get(0)
)
0    44019541285
1    44019119315
2    44019453624
3    44019406029
4    44019480564
Name: 0, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM