[英]Data validation with pandas_schema
I would like to read a CSV data file using Python pandas library and create visualizations.我想使用 Python pandas 库读取 CSV 数据文件并创建可视化。
First, I decided to validate the data.首先,我决定验证数据。
I want to use the pandas_schema module to validate the data at each column.我想使用 pandas_schema 模块来验证每一列的数据。
The initial data file has 26 columns.初始数据文件有 26 列。
My code:我的代码:
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation
schema = Schema ([
Column('Symboling', [InRangeValidation(-3,3)] ) , #integer from -3 to 3
Column('Normalized Loss', [InRangeValidation(65,256)] ) , # integer from 65 to 256
Column('Make',[LeadingWhitespaceValidation(), TrailingWhitespaceValidation()] ) , # text
Column('Fuel Type', [InListValidation(['diesel', 'gas'])]), # diesel, gas
Column('Aspiration'), # text
Column('Num of Doors' , [InListValidation(['two', 'four'])]), # text (two, four)
Column('Body Style' , [InListValidation(['hardtop', 'wagon','sedan','hatchback', 'convertible'])] ), # text: hardtop, wagon, sedan, hatchback, convertible
Column('Drive Wheels' , [InListValidation(['4wd', 'fwd' , 'rwd'])]), # text: 4wd, fwd, rwd
Column('Engine Location' , [InListValidation(['front', 'rear'])]), # text: front, rear
Column('Wheel Base' , [InRangeValidation([86.6,120.9])] ) , # decimal from 86.6 to 120.9
Column('Length' , [InRangeValidation(65,256)] ) , # decimal from 141.1 to 208.1
Column('Width' , [InRangeValidation(60.3,72.3)] ) , # decimal from 60.3 to 72.3
Column('Height' , [InRangeValidation(47.8,59.8)] ) , # decimal from 47.8 to 59.8
Column('Curb Weight' , [InRangeValidation(1488,4066)] ) , # integer from 1488 to 4066
Column('Engine Type'),[InListValidation(['ohc', 'dohcv', 'l', 'ohc', 'ohcf', 'ohcv', 'rotor'])] , # text
Column('Num of Cylinders' , [InListValidation(['two','four','three','five','six','eight','twelve'])]) , # text: eight, five, four, six, three, twelve, two
Column('Engine Size' , [InRangeValidation(61,326)]) , # integer from 61 to 326
Column('Fuel System' , [InListValidation(['1bbl', '2bbl', '4bbl', 'idi','mfi','mpfi','spdi','spfi'])]), #string: 1bbl, 2bbl, 4bbl, idi,mfi,mpfi,spdi,spfi
Column('Bore' , [InRangeValidation(2.54,3.94)] ) , # decimal from 2.54 to 3.94
Column('Stroke', [InRangeValidation(2.07,4.17)] ) , #decimal from 2.07 to 4.17
Column('Compression Ratio' , [InRangeValidation(7,23)] ), # integer: from 7 to 23
Column('Horsepower' , [InRangeValidation(48,288)] ), # integer:from 48 to 288
Column('Peak rmp'), [InRangeValidation(4150,6600)] , # integer: from 4150 to 6600
Column('City mpg'), [InRangeValidation(13,49)] , #integer: from 13 to 49
Column('Highway mpg'), [InRangeValidation(16,54)] , # integer: 16 to 54
Column('Price'), [InRangeValidation(5118,45400)] # integer from 5118 to 45400
])
test_file = pd.read_csv(('E:\_Python_Projects_Data\Data_Visualization\Autos_Data_Set\Autos_Import_1985.csv'))
errors = schema.validate(test_file)
for error in errors:
print(error)
The invalid number of columns. The schema specifies 31, but the data frame has 26
For the record: you have a few typos in your Schema definition, closing parentheses in Column definitions too early.作为记录:您的架构定义中有一些拼写错误,在列定义中过早关闭括号。 This creates a list with 31 elements, which are interpreted as columns.
这将创建一个包含 31 个元素的列表,这些元素被解释为列。
The correct definition should be:正确的定义应该是:
schema = Schema([
Column('Symboling', [InRangeValidation(-3,3)]), #integer from -3 to 3
Column('Normalized Loss', [InRangeValidation(65,256)]), # integer from 65 to 256
Column('Make', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()] ), # text
Column('Fuel Type', [InListValidation(['diesel', 'gas'])]), # diesel, gas
Column('Aspiration'), # text
Column('Num of Doors', [InListValidation(['two', 'four'])]), # text (two, four)
Column('Body Style', [InListValidation(['hardtop', 'wagon','sedan','hatchback', 'convertible'])]), # text: hardtop, wagon, sedan, hatchback, convertible
Column('Drive Wheels', [InListValidation(['4wd', 'fwd', 'rwd'])]), # text: 4wd, fwd, rwd
Column('Engine Location', [InListValidation(['front', 'rear'])]), # text: front, rear
Column('Wheel Base', [InRangeValidation([86.6,120.9])]), # decimal from 86.6 to 120.9
Column('Length', [InRangeValidation(65,256)]), # decimal from 141.1 to 208.1
Column('Width', [InRangeValidation(60.3,72.3)]), # decimal from 60.3 to 72.3
Column('Height', [InRangeValidation(47.8,59.8)]), # decimal from 47.8 to 59.8
Column('Curb Weight', [InRangeValidation(1488,4066)]), # integer from 1488 to 4066
Column('Engine Type', [InListValidation(['ohc', 'dohcv', 'l', 'ohc', 'ohcf', 'ohcv', 'rotor'])]), # text
Column('Num of Cylinders', [InListValidation(['two','four','three','five','six','eight','twelve'])]), # text: eight, five, four, six, three, twelve, two
Column('Engine Size', [InRangeValidation(61,326)]), # integer from 61 to 326
Column('Fuel System', [InListValidation(['1bbl', '2bbl', '4bbl', 'idi','mfi','mpfi','spdi','spfi'])]), #string: 1bbl, 2bbl, 4bbl, idi,mfi,mpfi,spdi,spfi
Column('Bore', [InRangeValidation(2.54,3.94)]), # decimal from 2.54 to 3.94
Column('Stroke', [InRangeValidation(2.07,4.17)]), #decimal from 2.07 to 4.17
Column('Compression Ratio', [InRangeValidation(7,23)]), # integer: from 7 to 23
Column('Horsepower', [InRangeValidation(48,288)]), # integer:from 48 to 288
Column('Peak rmp', [InRangeValidation(4150,6600)]), # integer: from 4150 to 6600
Column('City mpg', [InRangeValidation(13,49)]), #integer: from 13 to 49
Column('Highway mpg', [InRangeValidation(16,54)]), # integer: 16 to 54
Column('Price', [InRangeValidation(5118,45400)]), # integer from 5118 to 45400
])
Dataframe columns must match the number of columns in the defined validation schema. Dataframe 列必须与定义的验证模式中的列数匹配。 Alternate way is define a new dataframe with list of columns that you want to compare and use that for validation.
另一种方法是定义一个新的 dataframe ,其中包含要比较的列列表并将其用于验证。 (Not really sure if this is most efficient way, but it solves the purpose)
(不确定这是否是最有效的方法,但它解决了目的)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.