This is for a data mining task where we are automating scoring of the quality of the extraction. There is a gold standard csv that might consist of the the fields that look like
golden_standard.csv
| id | description | amount | date |
|----|-------------------------|---------|------------|
| 1 | Some description. | $150.54 | 12/12/2012 |
| 2 | Some other description. | $200 | 10/10/2015 |
| 3 | Other description. | $25 | 11/11/2014 |
| 4 | My description | $11.35 | 01/01/2015 |
| 5 | Your description. | $20 | 03/03/2013 |
, and then there are two possible extraction results files:
extract1.csv
| id | description | date |
|----|-------------------------|------------|
| 1 | Some description. | 12/12/2012 |
| 2 | Some other description. | 10/10/2015 |
| 3 | Other description. | 11/11/2014 |
| 4 | 122333222233332221 | 11/11/2014 |
| 5 | Your description. | 03/03/2013 |
extract2.csv
| id | description | amount | date |
|----|-------------------------|---------|------------|
| 1 | Some description. | $150.54 | 12/12/2012 |
| 2 | Some other description. | $200 | 10/10/2015 |
| - | ----------------------- | ----- | ---------- |
| 5 | Your description. | $20 | 03/03/2013 |
extract3.csv
| Garbage | More Garbage |
| Garbage | More Garbage |
And I would like to have my program report that extract 1 is missing a column and that values are not properly matched in column 2.
For the second case, I am missing entries and that some rows are all mismatched.
In the last case, resulting csv was all screwed up, but I still want program to detect some meaningful abberation.
Does anyone have some quick and clever way in python to do this kind of comparison?
I have my regular, longish row-by-row and column-by-column iterative way that I could post here, but I am thinking that there might be a quicker, more elegant Pythonic way to do this.
Any help is greatly appreciated.
Disclaimer: My approach uses the pandas
library.
First, data set-up.
gold_std.csv
id,description,amount,date
1,Some description.,$150.54,12/12/2012
2,Some other description.,$200,10/10/2015
3,Other description.,$25,11/11/2014
4,My description,$11.35,01/01/2015
5,Your description.,$20,03/03/2013
extract1.csv
id,description,date
1,Some description.,12/12/2012
2,Some other description.,10/10/2015
3,Other description.,11/11/2014
4,122333222233332221,11/11/2014
5,Your description.,03/03/2013
extract2.csv
id,description,amount,date
1,Some description.,$150.54,12/12/2012
2,Some other description.,$200,10/10/2015
3,Other description.,$25,11/11/2014
5,Your description.,$20,03/03/2013
Second, code.
import pandas as pd
def compare_extract(extract_name, reference='gold_std.csv'):
gold = pd.read_csv(reference)
ext = pd.read_csv(extract_name)
gc = set(gold.columns)
header = ext.columns
extc = set(header)
if gc != extc:
missing = ", ".join(list(gc - extc))
print "Extract has the following missing columns: {}".format(missing)
else:
print "Extract has the same column as standard. Checking for abberant rows..."
gold_list = gold.values.tolist()
ext_list = ext.values.tolist()
# Somewhat non-pandaic approach because possible no same IDs so we're relying
# on set operations instead. A bit hackish, actually.
diff = list(set(map(tuple, gold_list)) - set(map(tuple, ext_list)))
df = pd.DataFrame(diff, columns=header)
print "The following rows are not in the extract: "
print df
Third, test runs.
e1 = 'extract1.csv'
compare_extract(e1)
# Extract has the following missing columns: amount
e2 = 'extract2.csv'
compare_extract(e2)
# Extract has the same column as standard. Checking for abberant rows...
# The following rows are not in the extract:
# id description amount date
# 0 4 My description $11.35 01/01/2015
Finally, the last extract is a bit arbitrary. I think for that one you're better off writing a non- pandas
algorithm.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.