简体   繁体   中英

CSV cleaning with Regex

I'm a bit new to regex and I can't get my code working. I've got a set of data stored in a csv. Some of them might be "dirty", ie not the format I'm expecting them to be. Typically, the data looks like this: 123.4 unit

So for example: it can be

  • 0.4 %
  • 1234.45 kcal/kg
  • 23.245 UI/kg

So, it is a:

[unknown number of digits] + . + [unknown number of digits] + \\s + [unit = bunch of characters from a to z with a "/" between them]

My code is the following:

def parse_csv(content, delimiter = ';'):  ##We use here ";" to parse CSV because of the European way of dealing with excel-csv
  csv_data = []
  for line in content.split('\n'):
    csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
  return csv_data


Sans_ND=parse_csv(open('Sans_ND.csv','rU',encoding="ISO-8859-1").read())
 for row in Sans_ND:
    for i in range(1,len(row)): 
        if re.search(r"\d+\.\d+\s\b[a-z]+/[a-z]+\b",item):
            continue
        else:
            print("Formating Error",row[i],"in",row[0],"Col=",i)

Since the output is the entire array, and since not my entire array is badly formatted, I'm pretty sure my Regex translation of what I wanted was mediocre. Furthermore, I've tried to replace [az] with \\w but it didn't improve the output.

How can I fix this? What didn't I understand about Regex here?

EDIT : WhatI mean by "dirty" is something looking like 0.4-32-0 % or 0,4 mg/kg for example.

EDIT: With the current code and the one suggested by @sln in the comments, I get for example:

 Formating Error 0.1 % en Arachidonic acid  col 25
 Formating Error 0.07 % en Arachidonic acid col 26
 Formating Error 0.07 % en Arachidonic acid  col 27
 Formating Error 0.08 % en Arachidonic acid  col 39
 Formating Error 0.08 % en Arachidonic acid  col 40

EDIT2 : with sin answer i get the same type of error. Here are some additional output :

Formatting error 350 mg/kg in Angelica root col 2
Formatting error 350 mg/kg in Angelica root  col 3
Formatting error 350 mg/kg en Angelica root col 4

EDIT3 : these are some inputs from the Sans_ND.csv for commentator that requested it (b3000)

Arachidonic acid;Arachidonic Acid;0.07 %;0.08 %;0.07 %;0.06 %
Arginine;;2.2%;2.2%;2.2%;2.2%;1.8%
Beta carotene,Beta-carotene;;1.5 mg/kg;1.5 mg/kg;0.4 mg/kg
Branched-chain amino acids,Branched-chain amino acids;;1.54 %;1.65 %;2%

For exemple.

Those inputs don't contain "dirty" such as given exemple of dirty formatting.

Thanks to sln, here is the answer :

for row in Sans_ND:
    for i in range(2,len(row)): 
        if re.match("\d+(?:\.\d+)?\s*(?:%|[a-zA-Z]+/[a-zA-Z]+)",row[i]):

            continue
        else:
            print("Formatting error",row[i],"in",row[0],"col",i)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM