I'm a bit new to regex and I can't get my code working. I've got a set of data stored in a csv. Some of them might be "dirty", ie not the format I'm expecting them to be. Typically, the data looks like this: 123.4 unit
So for example: it can be
So, it is a:
[unknown number of digits] + . + [unknown number of digits] + \\s + [unit = bunch of characters from a to z with a "/" between them]
My code is the following:
def parse_csv(content, delimiter = ';'): ##We use here ";" to parse CSV because of the European way of dealing with excel-csv
csv_data = []
for line in content.split('\n'):
csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
return csv_data
Sans_ND=parse_csv(open('Sans_ND.csv','rU',encoding="ISO-8859-1").read())
for row in Sans_ND:
for i in range(1,len(row)):
if re.search(r"\d+\.\d+\s\b[a-z]+/[a-z]+\b",item):
continue
else:
print("Formating Error",row[i],"in",row[0],"Col=",i)
Since the output is the entire array, and since not my entire array is badly formatted, I'm pretty sure my Regex translation of what I wanted was mediocre. Furthermore, I've tried to replace [az]
with \\w
but it didn't improve the output.
How can I fix this? What didn't I understand about Regex here?
EDIT : WhatI mean by "dirty" is something looking like 0.4-32-0 % or 0,4 mg/kg for example.
EDIT: With the current code and the one suggested by @sln in the comments, I get for example:
Formating Error 0.1 % en Arachidonic acid col 25
Formating Error 0.07 % en Arachidonic acid col 26
Formating Error 0.07 % en Arachidonic acid col 27
Formating Error 0.08 % en Arachidonic acid col 39
Formating Error 0.08 % en Arachidonic acid col 40
EDIT2 : with sin answer i get the same type of error. Here are some additional output :
Formatting error 350 mg/kg in Angelica root col 2
Formatting error 350 mg/kg in Angelica root col 3
Formatting error 350 mg/kg en Angelica root col 4
EDIT3 : these are some inputs from the Sans_ND.csv for commentator that requested it (b3000)
Arachidonic acid;Arachidonic Acid;0.07 %;0.08 %;0.07 %;0.06 %
Arginine;;2.2%;2.2%;2.2%;2.2%;1.8%
Beta carotene,Beta-carotene;;1.5 mg/kg;1.5 mg/kg;0.4 mg/kg
Branched-chain amino acids,Branched-chain amino acids;;1.54 %;1.65 %;2%
For exemple.
Those inputs don't contain "dirty" such as given exemple of dirty formatting.
Thanks to sln, here is the answer :
for row in Sans_ND:
for i in range(2,len(row)):
if re.match("\d+(?:\.\d+)?\s*(?:%|[a-zA-Z]+/[a-zA-Z]+)",row[i]):
continue
else:
print("Formatting error",row[i],"in",row[0],"col",i)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.