There are strings (which are rows of a pandas
data frame):
2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,05 4.7197189.46 09/2020 172,0 5 4.7861901.25 9/2020 8 9,16
2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0 4.7861901.25 10/2020 94,32
What I need is to extract the following information from these lines (comma is decimal separator here):
order_id date sum
2.5807003.49 09/2020 24,54
4.7103181.69 09/2020 172,05
4.7197189.46 09/2020 172,05
4.7861901.25 09/2020 89,16
2.5807003.49 10/2020 35,65
4.7103181.69 10/2020 185,50
4.7197189.46 10/2020 185,50
4.7861901.25 10/2020 94,32
order_id
and they always have the same structure;date
can have one or two digits; 9 /2020
or 9/2020
, 10/2020
or 1 0/2020
;sum
always has two digits after ,
but, unfortunately, may (or not) also contain spaces: 24,54
, 172,0 5
, 185,5 0
, 94,32
. This is how I thought about it: First step would be to split orders. Obviously, the information related to a given order_id
is anchored by ,
-- there are 4 orders and 4 commas in each line. So, find comma, then go forward two digits (regardless of spaces) after it, and finally look back till the beginning of line or the end of previous order. But even here I'm stuck since I don't understand how to catch two digits after comma regardless of spaces (if any).
Input df
vals
0 2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5
1 4.7861901.25 9/2020 8 9,16
2 2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0
3 4.7861901.25 10/2020 94 ,32
Now, as multiple rows in the expected df is combined in a single row in original df, it is better to first convert the whole vals
column to a single string
str1 = "\n".join(df['vals'].values)
str1
2.5807003.49 9/2020 24,54 4.7103181.69 9 /2020 172,0 5 4.7197189.46 09/2020 172,0 5
4.7861901.25 9/2020 8 9,16
2.5807003.49 10/2020 35,65 4.7103181.69 10/2020 185,50 4.7197189.46 1 0/2020 185,5 0
4.7861901.25 10/2020 94 ,32
Now using findall
get all the final records. All the three required columns are in separate capture groups. order_id
is ([\d\.]+)
. As it has no space, it is straight forward. date
is (\d\s?\d?\s?\/\s?(?:\d\s?){3}\d)
where space can be anywhere in the date. sum
is [\d\s]+,\s?\d\s?\d)
which has two digits after the comma.
req_vals = re.findall(r'([\d\.]+)\s*(\d\s?\d?\s?\/\s?(?:\d\s?){3}\d)\s*([\d\s]+,\s?\d\s?\d)',str1)
req_vals
[('2.5807003.49', '9/2020', '24,54'),
('4.7103181.69', '9 /2020', '172,0 5'),
('4.7197189.46', '09/2020', '172,0 5'),
('4.7861901.25', '9/2020', '8 9,16'),
('2.5807003.49', '10/2020', '35,65'),
('4.7103181.69', '10/2020', '185,50'),
('4.7197189.46', '1 0/2020', '185,5 0'),
('4.7861901.25', '10/2020', '94 ,32')]
Lastly, In Output dataframe, space can be removed.
final_df = (pd.DataFrame(req_vals, columns=['order_id', 'date', 'sum'])
.replace(r'\s', '', regex=True))
final_df
order_id date sum
0 2.5807003.49 9/2020 24,54
1 4.7103181.69 9/2020 172,05
2 4.7197189.46 09/2020 172,05
3 4.7861901.25 9/2020 89,16
4 2.5807003.49 10/2020 35,65
5 4.7103181.69 10/2020 185,50
6 4.7197189.46 10/2020 185,50
7 4.7861901.25 10/2020 94,32
A regex that could work for the provided example:
(2\.5807003\.49|4\.7103181\.69|4\.7197189\.46|4\.7861901\.25)\s+([\d\s]+\/\d{4})\s+([\d\s]+,[\d\s]+)(?:\s|$)
Demo: https://regex101.com/r/VLc53D/1/
Or if there can be a space in the year:
(2\.5807003\.49|4\.7103181\.69|4\.7197189\.46|4\.7861901\.25)\s+([\d\s]+\/[\d\s]+)\s+([\d\s]+,[\d\s]+)(?:\s|$)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.