[英]Python: Check if dataframe column contain string type
I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes.我想检查数据框中的列是否由字符串组成,以便我可以用数字标记它们以用于机器学习。 Some columns consists of numbers, I dont want to change them.有些列由数字组成,我不想更改它们。 Columns example can be seen below:列示例如下所示:
TRAIN FEATURES
Age Level
32.0 Silver
61.0 Silver
66.0 Silver
36.0 Gold
20.0 Silver
29.0 Silver
46.0 Silver
27.0 Silver
Thank you=)谢谢你=)
Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.请注意,上述答案将包括 DateTime、TimeStamp、Category和其他数据类型。
Using object
is more restrictive (although I am not sure if other dtypes
would also of object
dtype):使用object
更具限制性(尽管我不确定其他dtypes
是否也属于object
dtype):
Create the dataframe:创建数据框:
df = pd.DataFrame({ 'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4], 'd': ['A', 'B', 'B', 'A'], 'e': pd.to_datetime('today')}) df['d'] = df['d'].astype('category')
That will look like this:看起来像这样:
a b c d e
0 a 1 NaN A 2018-05-17
1 b b 2.0 B 2018-05-17
2 c c 3.0 B 2018-05-17
3 d 2 4.0 A 2018-05-17
You can check the types calling dtypes
:您可以检查调用dtypes
的类型:
df.dtypes a object b object c float64 d category e datetime64[ns] dtype: object
You can list the strings columns using the items()
method and filtering by object
:您可以使用items()
方法列出字符串列并按object
过滤:
> [ col for col, dt in df.dtypes.items() if dt == object] ['a', 'b']
Or you can use select_dtypes to display a dataframe with only the strings:或者您可以使用 select_dtypes 显示仅包含字符串的数据框:
df.select_dtypes(include=[object]) ab 0 a 1 1 bb 2 cc 3 d 2
Yes, its possible.是的,有可能。 You use dtype
你使用dtype
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
print('yes')
else:
print('no')
You can also select your columns by dtype using select_dtypes
您还可以使用select_dtypes
按 dtype 选择列
df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset
4 years since the creation of this question and I believe there's still not a definitive answer.自提出这个问题4年以来,我相信仍然没有明确的答案。
I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0).我不认为字符串在 Pandas 中被视为一等公民(甚至 >= 1.0.0)。 As an example:举个例子:
import pandas as pd
import datetime
df = pd.DataFrame({
'str': ['a', 'b', 'c', None],
'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})
string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))
heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))
prints印刷
object
True
object
True
so although hete
does not contain any explicit strings, it is considered as a string series.因此,虽然hete
不包含任何显式字符串,但它被视为字符串系列。
After reading the documentation , I think the only way to make sure a series contains only strings is:阅读文档后,我认为确保系列仅包含字符串的唯一方法是:
def is_string_series(s : pd.Series):
if isinstance(s.dtype, pd.StringDtype):
# The series was explicitly created as a string series (Pandas>=1.0.0)
return True
elif s.dtype == 'object':
# Object series, check each value
return all((v is None) or isinstance(v, str) for v in s)
else:
return False
I use a 2-step approach: first to determine if dtype==object
, and then if so, I got the first row of data to see if that column's data was a string or not.我使用两步方法:首先确定是否dtype==object
,然后如果是,我获取第一行数据以查看该列的数据是否为字符串。
c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
# do something
Expanding on Scratch'N'Purr's answer:扩展 Scratch'N'Purr 的回答:
>>> df = pd.DataFrame({'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4]})
>>> df
a b c
0 a 1 NaN
1 b b 2.0
2 c c 3.0
3 d 2 4.0
>>> dict(filter(lambda x: x[1] != np.number, list(zip(df.columns, df.dtypes))))
{'a': dtype('O'), 'b': dtype('O')}
So I've added some columns with mixed types.所以我添加了一些混合类型的列。 You can see that the filter
+ dict
approach yields key: value mappings of which columns have dtypes outside of the bounds of np.number
.您可以看到filter
+ dict
方法产生了 key: value 映射,其中列的 dtype 超出了np.number
的范围。 This ought to work well at scale.这应该在规模上运作良好。 You could also try coercing each column to a specific type (eg int
) and then catch the ValueError
exception when you can't convert a string column to int
.您还可以尝试将每列强制转换为特定类型(例如int
),然后在无法将字符串列转换为int
时捕获ValueError
异常。 Lots of ways to do this.很多方法可以做到这一点。
With Pandas 1.0 convert_dtypes
was introduced.在 Pandas 1.0 中引入了convert_dtypes
。 When a column was not explicitly created as StringDtype
it can be easily converted.当列未显式创建为StringDtype
时,可以轻松转换。
pd.StringDtype.is_dtype
will then return True
for wtring columns. pd.StringDtype.is_dtype
然后将为 wstring 列返回True
。 Even when they contain NA values.即使它们包含 NA 值。
For old and new style strings the complete series of checks could be something like this:对于新旧风格的字符串,完整的检查系列可能是这样的:
def has_string_type(s: pd.Series) -> bool:
if pd.StringDtype.is_dtype(s.dtype):
# StringDtype extension type
return True
if s.dtype != "object":
# No object column - definitely no string
return False
try:
s.str
except AttributeError:
return False
# The str accessor exists, this must be a String column
return True
As far as I can tell, the only sure fire way to know what types are there is to check the values, then you can do an assertion to see if it's what you expect.据我所知,知道有哪些类型的唯一可靠方法是检查值,然后你可以做一个断言来看看它是否是你所期望的。
The below function will get the dtypes of each value in a column, drop duplicates and then cast to a list so you can view/interact with it.下面的函数将获取列中每个值的 dtypes,删除重复项,然后转换为列表,以便您可以查看/与之交互。 This will let you deal with mixed types, objects and NAs the way you wish (of course np.nan is of type float but I leave such things to the interested reader)这将让您以您希望的方式处理混合类型、对象和 NA(当然 np.nan 是 float 类型,但我将这些事情留给感兴趣的读者)
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4],
"col2": ["a", "b", "c", "d"],
"col3": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
})
print(df.dtypes.to_dict())
# {'col1': dtype('int64'), 'col2': dtype('O'), 'col3': dtype('O')}
def true_dtype(df): # You could add a column filter here too
return {col: df[col].apply(lambda x: type(x)).unique().tolist() for col in df.columns}
true_types = true_dtype(df)
print(true_types)
# {'col1': [<class 'int'>], 'col2': [<class 'str'>], 'col3': [<class 'list'>]}
print(true_types['col2'] == [str])
# True
This will return a list of column name whose dtype is string(object in this case)这将返回 dtype 为字符串的列名列表(在这种情况下为对象)
#let df be your dataframe
df.columns[df.dtypes==object].tolist()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.