简体   繁体   English

如果数据类型错误(检查类型),如何跳过加载到 Pandas dataframe 的 excel 文件的行

[英]How to skip the lines of an excel file loaded to a Pandas dataframe if data types are wrong (checking types)

I have just coded this:我刚刚编码了这个:

import os
import pandas as pd

files = os.listdir(path)

#AllData = pd.DataFrame() 

for f in files:
    info = pd.read_excel(f, "File")
    info.fillna(0)
    try:
        info['Country'] = info['Country'].astype('str')
    except ValueError:
        continue
    try:
        info['Name'] = info['Name'].astype('str')
    except ValueError:
        continue
    try:
        info['Age'] = info['Age'].astype('int')
    except ValueError as error:
        continue
        
    writer = pd.ExcelWriter("Output.xlsx")
    info.to_excel(writer, "Sheet 1")
    writer.save()

It reads some excel files, selects a sheet named "File" and put all its data in a dataframe.它读取一些 excel 文件,选择一个名为“文件”的工作表并将其所有数据放入 dataframe。 Once it is done, it returns all the records.完成后,它会返回所有记录。

What I want is to check the types of all the values of each column, and to skip the line in the reading source if the type is not the one I want for this column.我想要的是检查每一列的所有值的类型,如果类型不是我想要的这一列,则跳过阅读源中的行。 Finally I want to record in the output the data that fits the types I want.最后我想在 output 中记录适合我想要的类型的数据。

I tried to use astype but that's not working as expected.我尝试使用astype但这没有按预期工作。

Thus, read source - check astype - if not astype - skip line and keep running the code.因此,阅读源代码 - 检查 astype - 如果不是 astype - 跳过行并继续运行代码。

I first have to say that type checking and type casting are 2 different things.我首先必须说类型检查类型转换是两件不同的事情。

Pandas' astype is used for type casting (it will "convert" a type to another type, it will not check if a value is of certain type). Pandas 的astype用于类型转换(它将一种类型“转换”为另一种类型,它不会检查值是否属于某种类型)。

But if what you want is to not keep the rows that can't be cast as numeric type, you can do it like this:但是,如果您想要不保留无法转换为数字类型的行,您可以这样做:

info['Age'] = pd.to_numeric(info['Age'], errors='coerce')
info = info.dropna()

Note that you don't have to use a try-except block here.请注意,您不必在此处使用 try-except 块。 Here, we use to_numeric because we can pass errors='coerce' , so that if it can't be cast, the value will be NaN , and then we use dropna() in order to remove rows contaiing NaN s.在这里,我们使用to_numeric是因为我们可以传递errors='coerce' ,因此如果不能强制转换,则值将是NaN ,然后我们使用dropna()来删除包含NaN的行。

Update about type checking:关于类型检查的更新:

Here I'll add some informations you asked in comment about how to check types in pandas dataframes:在这里,我将添加一些您在评论中询问的有关如何检查 pandas 数据帧中的类型的信息:

  • How to get the types inferred by pandas for each column?如何获取 pandas 为每列推断的类型?
  • How to check the types of all values of the whole dataframe?如何查看整个dataframe的所有值的类型?
  • Some useful functions for type checkings一些有用的类型检查函数
  • Ways to check types in Python在 Python 中检查类型的方法

How to get the types infered by pandas for each column?如何获取 pandas 为每一列推断的类型?

columns_dtypes = df.dtypes

It will output something like this:它将 output 是这样的:

Country     object
Name        object
Age        int64
dtype: object

Note that i your column "Age" contains some Nan values the dtype could be float64 .请注意,我的列“Age”包含一些dtype可能是float64Nan值。

And when a column contains strings, the dtype will be object when you'll load your excel file to a dataframe like in your example.当列包含字符串时,当您将dtype文件加载到object时,dtype 将为 object ,就像您的示例一样。 See below for how to check if an object is a Python string (type str ).请参阅下文,了解如何检查 object 是否为 Python 字符串(类型str )。

Pandas documentation listing all dtypes: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html?highlight=basics#dtypes Pandas 文档列出所有数据类型: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.ZFC35FDC70D5FC69D269883A822Cdhighlighttype=basics#?

Other useful information about Pandas dtypes: what are all the dtypes that pandas recognizes?关于 Pandas dtypes 的其他有用信息: pandas 识别的所有 dtypes 是什么?

How to check the types of all values of the whole dataframe?如何查看整个dataframe的所有值的类型?

There are numerous ways of doing this.有很多方法可以做到这一点。

Here is one way.这是一种方法。 I choose this code because it's clear and simple:我选择这段代码是因为它清晰简单:

# Iterate over all the columns
for (column_name, column_data) in info.iteritems():
    print("column_name: ", column_name)
    print("column_data: ", column_data.values)

    # Iterate over all the values of this column
    for column_value in column_data.values:
        # print the value and its type
        print(column_value, type(column_value))
        # So here you can check the type and do something with that
        # For example, log the error to a log file

Some useful functions for type checkings:一些有用的类型检查函数:

How to test if object (as returned by df.dtypes like in the output above) is a string?如何测试object (如上面的df.dtypes中的 df.dtypes 返回)是否为字符串? isinstance(object_to_test, str) See: How to find out if a Python object is a string? isinstance(object_to_test, str)请参阅: 如何确定 Python object 是否为字符串?

Now, if you have a column that contains strings (like "hello", "world", etc.) and some of these strings are int , and you want to check if these stings represent a number, or a int you can use these functions:现在,如果您有一个包含字符串(如“hello”、“world”等)的列,并且其中一些字符串是int ,并且您想检查这些字符串是代表数字还是int ,您可以使用这些功能:

How to check if a string is an int ?如何检查字符串是否为int

def str_is_int(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

How to check if a string is an number?如何检查字符串是否为数字?

def str_is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

Python's strings have a method isdigit() , but it can't be used to check for int or number, because it will fail with one = "+1" or minus_one = "-1" . Python 的字符串有一个方法isdigit() ,但它不能用于检查 int 或 number,因为它会因one = "+1"minus_one = "-1"而失败。

And finally, here are 2 common ways to check "types" in Python:最后,这里有两种常见的方法来检查 Python 中的“类型”:

object_to_test = 1

print( type(object_to_test) is int)
print( type(object_to_test) in (int, float) ) # Check is is one of those types

print( isinstance(object_to_test, int) )

isinstance(object_to_test, str) will return True if object_to_test is of type str OR any sublass of str . isinstance(object_to_test, str)如果object_to_test的类型为strstr的任何子类,则返回True

type(object_to_test) is str will return True if object_to_test is ONLY of type str (excluding any subclass of str ) type(object_to_test) is str将返回True如果object_to_test仅是str类型(不包括str的任何子类)

There is also a libray called pandas-stubs that could be useful for type safety: https://github.com/VirtusLab/pandas-stubs .还有一个名为pandas-stubs的库,可用于类型安全: https://github.com/VirtusLab/pandas-stubs

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM