简体   繁体   English

使用 pandas_schema 进行数据验证

[英]Data validation with pandas_schema

I would like to read a CSV data file using Python pandas library and create visualizations.我想使用 Python pandas 库读取 CSV 数据文件并创建可视化。
First, I decided to validate the data.首先,我决定验证数据。
I want to use the pandas_schema module to validate the data at each column.我想使用 pandas_schema 模块来验证每一列的数据。
The initial data file has 26 columns.初始数据文件有 26 列。
My code:我的代码:

from pandas_schema import Column, Schema 
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation



schema = Schema ([
    Column('Symboling', [InRangeValidation(-3,3)] ) ,  #integer from -3 to 3 
    Column('Normalized Loss', [InRangeValidation(65,256)] )  , # integer from 65 to 256
    Column('Make',[LeadingWhitespaceValidation(), TrailingWhitespaceValidation()] )  , # text 
    Column('Fuel Type', [InListValidation(['diesel', 'gas'])]), # diesel, gas
    Column('Aspiration'), # text 
    Column('Num of Doors' , [InListValidation(['two', 'four'])]), # text (two, four)
    Column('Body Style' , [InListValidation(['hardtop', 'wagon','sedan','hatchback', 'convertible'])] ), # text: hardtop, wagon, sedan, hatchback, convertible 
    Column('Drive Wheels' , [InListValidation(['4wd', 'fwd' , 'rwd'])]), # text: 4wd, fwd, rwd 
    Column('Engine Location' , [InListValidation(['front', 'rear'])]), # text: front, rear
    Column('Wheel Base' , [InRangeValidation([86.6,120.9])] ) ,  # decimal from 86.6 to 120.9 
    Column('Length' , [InRangeValidation(65,256)] )  ,  # decimal from 141.1 to 208.1
    Column('Width' , [InRangeValidation(60.3,72.3)] ) ,  # decimal from 60.3 to 72.3 
    Column('Height' , [InRangeValidation(47.8,59.8)] ) ,   # decimal from 47.8 to 59.8
    Column('Curb Weight' , [InRangeValidation(1488,4066)] ) ,   # integer from 1488 to 4066
    Column('Engine Type'),[InListValidation(['ohc', 'dohcv', 'l', 'ohc', 'ohcf', 'ohcv', 'rotor'])] , # text 
    Column('Num of Cylinders' , [InListValidation(['two','four','three','five','six','eight','twelve'])]) , # text: eight, five, four, six, three, twelve, two 
    Column('Engine Size' , [InRangeValidation(61,326)]) ,  # integer from 61 to 326 
    Column('Fuel System' , [InListValidation(['1bbl', '2bbl', '4bbl', 'idi','mfi','mpfi','spdi','spfi'])]), #string: 1bbl, 2bbl, 4bbl, idi,mfi,mpfi,spdi,spfi 
    Column('Bore' , [InRangeValidation(2.54,3.94)] ) , # decimal from 2.54 to 3.94 
    Column('Stroke', [InRangeValidation(2.07,4.17)] ) , #decimal from 2.07 to 4.17 
    Column('Compression Ratio' , [InRangeValidation(7,23)] ), #  integer: from 7 to 23 
    Column('Horsepower' , [InRangeValidation(48,288)] ),  # integer:from 48 to 288 
    Column('Peak rmp'), [InRangeValidation(4150,6600)]  , # integer: from 4150 to 6600 
    Column('City mpg'), [InRangeValidation(13,49)]  , #integer: from 13 to 49 
    Column('Highway mpg'), [InRangeValidation(16,54)] ,  # integer: 16 to 54 
    Column('Price'), [InRangeValidation(5118,45400)]  # integer from 5118 to 45400 
])

test_file = pd.read_csv(('E:\_Python_Projects_Data\Data_Visualization\Autos_Data_Set\Autos_Import_1985.csv')) 
errors = schema.validate(test_file) 
for error in errors: 
    print(error)

After I ran the code, I have the notification: 运行代码后,我收到通知:
 The invalid number of columns. The schema specifies 31, but the data frame has 26

I don't actually understand how this happened: at the schema, I have 26 columns; 我实际上不明白这是怎么发生的:在架构中,我有 26 列; and the data file has 26 columns. 数据文件有 26 列。 Any suggestions? 有什么建议么?
Thank you. 谢谢你。

For the record: you have a few typos in your Schema definition, closing parentheses in Column definitions too early.作为记录:您的架构定义中有一些拼写错误,在列定义中过早关闭括号。 This creates a list with 31 elements, which are interpreted as columns.这将创建一个包含 31 个元素的列表,这些元素被解释为列。

The correct definition should be:正确的定义应该是:

schema = Schema([
    Column('Symboling', [InRangeValidation(-3,3)]),  #integer from -3 to 3 
    Column('Normalized Loss', [InRangeValidation(65,256)]), # integer from 65 to 256
    Column('Make', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()] ), # text 
    Column('Fuel Type', [InListValidation(['diesel', 'gas'])]), # diesel, gas
    Column('Aspiration'), # text 
    Column('Num of Doors', [InListValidation(['two', 'four'])]), # text (two, four)
    Column('Body Style', [InListValidation(['hardtop', 'wagon','sedan','hatchback', 'convertible'])]), # text: hardtop, wagon, sedan, hatchback, convertible 
    Column('Drive Wheels', [InListValidation(['4wd', 'fwd', 'rwd'])]), # text: 4wd, fwd, rwd 
    Column('Engine Location', [InListValidation(['front', 'rear'])]), # text: front, rear
    Column('Wheel Base', [InRangeValidation([86.6,120.9])]),  # decimal from 86.6 to 120.9 
    Column('Length', [InRangeValidation(65,256)]),  # decimal from 141.1 to 208.1
    Column('Width', [InRangeValidation(60.3,72.3)]),  # decimal from 60.3 to 72.3 
    Column('Height', [InRangeValidation(47.8,59.8)]),   # decimal from 47.8 to 59.8
    Column('Curb Weight', [InRangeValidation(1488,4066)]),   # integer from 1488 to 4066
    Column('Engine Type', [InListValidation(['ohc', 'dohcv', 'l', 'ohc', 'ohcf', 'ohcv', 'rotor'])]), # text 
    Column('Num of Cylinders', [InListValidation(['two','four','three','five','six','eight','twelve'])]), # text: eight, five, four, six, three, twelve, two 
    Column('Engine Size', [InRangeValidation(61,326)]),  # integer from 61 to 326 
    Column('Fuel System', [InListValidation(['1bbl', '2bbl', '4bbl', 'idi','mfi','mpfi','spdi','spfi'])]), #string: 1bbl, 2bbl, 4bbl, idi,mfi,mpfi,spdi,spfi 
    Column('Bore', [InRangeValidation(2.54,3.94)]), # decimal from 2.54 to 3.94 
    Column('Stroke', [InRangeValidation(2.07,4.17)]), #decimal from 2.07 to 4.17 
    Column('Compression Ratio', [InRangeValidation(7,23)]), #  integer: from 7 to 23 
    Column('Horsepower', [InRangeValidation(48,288)]),  # integer:from 48 to 288 
    Column('Peak rmp', [InRangeValidation(4150,6600)]), # integer: from 4150 to 6600 
    Column('City mpg', [InRangeValidation(13,49)]), #integer: from 13 to 49 
    Column('Highway mpg', [InRangeValidation(16,54)]),  # integer: 16 to 54 
    Column('Price', [InRangeValidation(5118,45400)]),  # integer from 5118 to 45400 
])

Dataframe columns must match the number of columns in the defined validation schema. Dataframe 列必须与定义的验证模式中的列数匹配。 Alternate way is define a new dataframe with list of columns that you want to compare and use that for validation.另一种方法是定义一个新的 dataframe ,其中包含要比较的列列表并将其用于验证。 (Not really sure if this is most efficient way, but it solves the purpose) (不确定这是否是最有效的方法,但它解决了目的)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM