简体   繁体   中英

Advanced string split - How to seperate string of product order from it's price?

I have a CSV file that has a field called 'basket_items' which is just a string of items seperated with a comma; it has the product name and flavour (if applicable) and the price.

Example of first two rows of CSV file:

timestamp,store,customer_name,basket_items,total_price,cash_or_card
06/06/2022 09:00,Chesterfield,Stephanie Neyhart,"Large Flat white - 2.45, Large Flavoured iced latte - Vanilla - 3.25, Large Flavoured iced latte - Hazelnut - 3.25",8.95,CASH
06/06/2022 09:02,Chesterfield,Donna Marley,"Large Flavoured iced latte - Hazelnut - 3.25, Regular Latte - 2.15, Large Flavoured iced latte - Vanilla - 3.25",8.65,CARD

Single row of basket_items field would be:

Large Flavoured iced latte - Hazelnut - 3.25, Regular Latte - 2.15, Large Flavoured iced latte - Vanilla - 3.25

I want to be able to iterate through each row in this CSV file and be able to obtain the product names and the prices seperately, and then later match up the product name to it's price. I am struggling to figure out how to do this.

Maybe I could have it in dictionary format, or as a list of products, I'm really not sure how to do it. I tried to mess around with:

data = pd.read_csv("team1-project/example_transactions.csv")
df = pd.DataFrame(data)

#Drop null values
df = df.dropna()

basket_items_list = []

for row in df.basket_items:
    order = row.split(',')
    basket_items_list.append(order)

But I got further rather than close than what I'm trying to do. Would appreciate any help. Thank you.

What about using a regex?:

regex = r'(?P<designation>(?!\s)[^,]*[^\s,]+)\s*-\s*(?P<price>\d+(?:\.\d+)?)'
df['basket_items'].str.extractall(regex)

output:

                                   designation price
  match                                             
0 0                           Large Flat white  2.45
  1       Large Flavoured iced latte - Vanilla  3.25
  2      Large Flavoured iced latte - Hazelnut  3.25
1 0      Large Flavoured iced latte - Hazelnut  3.25
  1                              Regular Latte  2.15
  2       Large Flavoured iced latte - Vanilla  3.25

For the unique values:

regex = r'(?P<designation>(?!\s)[^,]*[^\s,]+)\s*-\s*(?P<price>\d+(?:\.\d+)?)'
(df['basket_items'].str.extractall(regex)
 .drop_duplicates(['designation'])
 .reset_index(drop=True)
)

output:

                             designation price
0                       Large Flat white  2.45
1   Large Flavoured iced latte - Vanilla  3.25
2  Large Flavoured iced latte - Hazelnut  3.25
3                          Regular Latte  2.15

regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM