[英]How to force pandas read_csv to use float32 for all float columns?
Because因为
Note that not all columns in the raw csv file have float types.请注意,并非原始 csv 文件中的所有列都具有浮点类型。 I only need to set float32 as the default for float columns.
我只需要将 float32 设置为浮动列的默认值。
Try:尝试:
import numpy as np
import pandas as pd
# Sample 100 rows of data to determine dtypes.
df_test = pd.read_csv(filename, nrows=100)
float_cols = [c for c in df_test if df_test[c].dtype == "float64"]
float32_cols = {c: np.float32 for c in float_cols}
df = pd.read_csv(filename, engine='c', dtype=float32_cols)
This first reads a sample of 100 rows of data (modify as required) to determine the type of each column.这首先读取 100 行数据的样本(根据需要修改)以确定每列的类型。
It the creates a list of those columns which are 'float64', and then uses dictionary comprehension to create a dictionary with these columns as the keys and 'np.float32' as the value for each key.它创建了一个包含 'float64' 的列的列表,然后使用字典理解来创建一个以这些列作为键和 'np.float32' 作为每个键的值的字典。
Finally, it reads the whole file using the 'c' engine (required for assigning dtypes to columns) and then passes the float32_cols dictionary as a parameter to dtype.最后,它使用“c”引擎读取整个文件(将 dtype 分配给列所需),然后将 float32_cols 字典作为参数传递给 dtype。
df = pd.read_csv(filename, nrows=100)
>>> df
int_col float1 string_col float2
0 1 1.2 a 2.2
1 2 1.3 b 3.3
2 3 1.4 c 4.4
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col 3 non-null int64
float1 3 non-null float64
string_col 3 non-null object
float2 3 non-null float64
dtypes: float64(2), int64(1), object(1)
df32 = pd.read_csv(filename, engine='c', dtype={c: np.float32 for c in float_cols})
>>> df32.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col 3 non-null int64
float1 3 non-null float32
string_col 3 non-null object
float2 3 non-null float32
dtypes: float32(2), int64(1), object(1)
Here's a solution which does not depend on .join
or does not require reading the file twice:这是一个不依赖于
.join
或不需要两次读取文件的解决方案:
float64_cols = df.select_dtypes(include='float64').columns
mapper = {col_name: np.float32 for col_name in float64_cols}
df = df.astype(mapper)
Or for kicks as a one-liner:或者作为单线踢:
df = df.astype({c: np.float32 for c in df.select_dtypes(include='float64').columns})
@Alexander's is a great answer. @Alexander's 是一个很好的答案。 Some columns may need to be precise.
某些列可能需要精确。 If so, you may need to stick more conditionals into your list comprehension to exclude some columns the
any
or all
built ins are handy:如果是这样,您可能需要在列表理解中添加更多条件以排除
any
或all
内置函数方便的某些列:
float_cols = [c for c in df_test if all([df_test[c].dtype == "float64",
not df_test[c].name == 'Latitude', not df_test[c].name =='Longitude'])]
If you don't care about column order, there's also df.select_dtypes
which avoids having to read_csv
twice:如果您不关心列顺序,还有
df.select_dtypes
可以避免必须read_csv
两次:
import pandas as pd
df = pd.read_csv("file.csv")
df_float = df.select_dtypes(include=float).astype("float32")
df_not_float = df.select_dtypes(exclude=float)
df = df_float.join(df_not_float)
Or, if you want to convert all non-string columns (eg integer columns) to float:或者,如果要将所有非字符串列(例如整数列)转换为浮点数:
import pandas as pd
df = pd.read_csv("file.csv")
df_not_str = df.select_dtypes(exclude=object).astype("float32")
df_str = df.select_dtypes(include=object)
df = df_not_str.join(df_str)
I think it's slightly more efficient to call the dtypes, as opposed to jorijnsmit's solution...我认为调用 dtypes 稍微更有效率,而不是 jorijnsmit 的解决方案......
jorijnsmit's:乔里恩斯密特的:
%%timeit
df.astype({c: 'float32' for c in df.select_dtypes(include='float64').columns})
754 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
calling dtypes:调用数据类型:
%%timeit
df.astype({c: 'float32' for c in df.dtypes.index[df.dtypes == 'float64']})
538 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.