[英]How to skip the lines of an excel file loaded to a Pandas dataframe if data types are wrong (checking types)
I have just coded this:我刚刚编码了这个:
import os
import pandas as pd
files = os.listdir(path)
#AllData = pd.DataFrame()
for f in files:
info = pd.read_excel(f, "File")
info.fillna(0)
try:
info['Country'] = info['Country'].astype('str')
except ValueError:
continue
try:
info['Name'] = info['Name'].astype('str')
except ValueError:
continue
try:
info['Age'] = info['Age'].astype('int')
except ValueError as error:
continue
writer = pd.ExcelWriter("Output.xlsx")
info.to_excel(writer, "Sheet 1")
writer.save()
It reads some excel files, selects a sheet named "File" and put all its data in a dataframe.它读取一些 excel 文件,选择一个名为“文件”的工作表并将其所有数据放入 dataframe。 Once it is done, it returns all the records.
完成后,它会返回所有记录。
What I want is to check the types of all the values of each column, and to skip the line in the reading source if the type is not the one I want for this column.我想要的是检查每一列的所有值的类型,如果类型不是我想要的这一列,则跳过阅读源中的行。 Finally I want to record in the output the data that fits the types I want.
最后我想在 output 中记录适合我想要的类型的数据。
I tried to use astype
but that's not working as expected.我尝试使用
astype
但这没有按预期工作。
Thus, read source - check astype - if not astype - skip line and keep running the code.因此,阅读源代码 - 检查 astype - 如果不是 astype - 跳过行并继续运行代码。
I first have to say that type checking and type casting are 2 different things.我首先必须说类型检查和类型转换是两件不同的事情。
Pandas' astype
is used for type casting (it will "convert" a type to another type, it will not check if a value is of certain type). Pandas 的
astype
用于类型转换(它将一种类型“转换”为另一种类型,它不会检查值是否属于某种类型)。
But if what you want is to not keep the rows that can't be cast as numeric type, you can do it like this:但是,如果您想要不保留无法转换为数字类型的行,您可以这样做:
info['Age'] = pd.to_numeric(info['Age'], errors='coerce')
info = info.dropna()
Note that you don't have to use a try-except block here.请注意,您不必在此处使用 try-except 块。 Here, we use
to_numeric
because we can pass errors='coerce'
, so that if it can't be cast, the value will be NaN
, and then we use dropna()
in order to remove rows contaiing NaN
s.在这里,我们使用
to_numeric
是因为我们可以传递errors='coerce'
,因此如果不能强制转换,则值将是NaN
,然后我们使用dropna()
来删除包含NaN
的行。
Here I'll add some informations you asked in comment about how to check types in pandas dataframes:在这里,我将添加一些您在评论中询问的有关如何检查 pandas 数据帧中的类型的信息:
How to get the types infered by pandas for each column?如何获取 pandas 为每一列推断的类型?
columns_dtypes = df.dtypes
It will output something like this:它将 output 是这样的:
Country object
Name object
Age int64
dtype: object
Note that i your column "Age" contains some Nan
values the dtype
could be float64
.请注意,我的列“Age”包含一些
dtype
可能是float64
的Nan
值。
And when a column contains strings, the dtype
will be object
when you'll load your excel file to a dataframe like in your example.当列包含字符串时,当您将
dtype
文件加载到object
时,dtype 将为 object ,就像您的示例一样。 See below for how to check if an object is a Python string (type str
).请参阅下文,了解如何检查 object 是否为 Python 字符串(类型
str
)。
Pandas documentation listing all dtypes: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html?highlight=basics#dtypes Pandas 文档列出所有数据类型: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.ZFC35FDC70D5FC69D269883A822Cdhighlighttype=basics#?
Other useful information about Pandas dtypes: what are all the dtypes that pandas recognizes?关于 Pandas dtypes 的其他有用信息: pandas 识别的所有 dtypes 是什么?
How to check the types of all values of the whole dataframe?如何查看整个dataframe的所有值的类型?
There are numerous ways of doing this.有很多方法可以做到这一点。
Here is one way.这是一种方法。 I choose this code because it's clear and simple:
我选择这段代码是因为它清晰简单:
# Iterate over all the columns
for (column_name, column_data) in info.iteritems():
print("column_name: ", column_name)
print("column_data: ", column_data.values)
# Iterate over all the values of this column
for column_value in column_data.values:
# print the value and its type
print(column_value, type(column_value))
# So here you can check the type and do something with that
# For example, log the error to a log file
Some useful functions for type checkings:一些有用的类型检查函数:
How to test if object
(as returned by df.dtypes
like in the output above) is a string?如何测试
object
(如上面的df.dtypes
中的 df.dtypes 返回)是否为字符串? isinstance(object_to_test, str)
See: How to find out if a Python object is a string? isinstance(object_to_test, str)
请参阅: 如何确定 Python object 是否为字符串?
Now, if you have a column that contains strings (like "hello", "world", etc.) and some of these strings are int
, and you want to check if these stings represent a number, or a int
you can use these functions:现在,如果您有一个包含字符串(如“hello”、“world”等)的列,并且其中一些字符串是
int
,并且您想检查这些字符串是代表数字还是int
,您可以使用这些功能:
How to check if a string is an int
?如何检查字符串是否为
int
?
def str_is_int(s):
try:
int(s)
return True
except ValueError:
return False
How to check if a string is an number?如何检查字符串是否为数字?
def str_is_number(s):
try:
float(s)
return True
except ValueError:
return False
Python's strings have a method isdigit()
, but it can't be used to check for int or number, because it will fail with one = "+1"
or minus_one = "-1"
. Python 的字符串有一个方法
isdigit()
,但它不能用于检查 int 或 number,因为它会因one = "+1"
或minus_one = "-1"
而失败。
And finally, here are 2 common ways to check "types" in Python:最后,这里有两种常见的方法来检查 Python 中的“类型”:
object_to_test = 1
print( type(object_to_test) is int)
print( type(object_to_test) in (int, float) ) # Check is is one of those types
print( isinstance(object_to_test, int) )
isinstance(object_to_test, str)
will return True
if object_to_test
is of type str
OR any sublass of str
. isinstance(object_to_test, str)
如果object_to_test
的类型为str
或str
的任何子类,则返回True
。
type(object_to_test) is str
will return True
if object_to_test
is ONLY of type str
(excluding any subclass of str
) type(object_to_test) is str
将返回True
如果object_to_test
仅是str
类型(不包括str
的任何子类)
There is also a libray called pandas-stubs
that could be useful for type safety: https://github.com/VirtusLab/pandas-stubs .还有一个名为
pandas-stubs
的库,可用于类型安全: https://github.com/VirtusLab/pandas-stubs 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.