Python：检查数据框列是否包含字符串类型

Question

I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes.我想检查数据框中的列是否由字符串组成，以便我可以用数字标记它们以用于机器学习。 Some columns consists of numbers, I dont want to change them.有些列由数字组成，我不想更改它们。 Columns example can be seen below:列示例如下所示：

TRAIN FEATURES
  Age              Level  
  32.0              Silver      
  61.0              Silver  
  66.0              Silver      
  36.0              Gold      
  20.0              Silver     
  29.0              Silver     
  46.0              Silver  
  27.0              Silver

Thank you=)谢谢你=)

Answer 1

Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.请注意，上述答案将包括 DateTime、TimeStamp、Category和其他数据类型。

Using object is more restrictive (although I am not sure if other dtypes would also of object dtype):使用object更具限制性（尽管我不确定其他dtypes是否也属于object dtype）：

Create the dataframe:创建数据框：

 df = pd.DataFrame({ 'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4], 'd': ['A', 'B', 'B', 'A'], 'e': pd.to_datetime('today')}) df['d'] = df['d'].astype('category')

That will look like this:看起来像这样：

   a  b    c  d          e
0  a  1  NaN  A 2018-05-17
1  b  b  2.0  B 2018-05-17
2  c  c  3.0  B 2018-05-17
3  d  2  4.0  A 2018-05-17

You can check the types calling dtypes :您可以检查调用dtypes的类型：

 df.dtypes a object b object c float64 d category e datetime64[ns] dtype: object

You can list the strings columns using the items() method and filtering by object :您可以使用items()方法列出字符串列并按object过滤：
```
 > [ col for col, dt in df.dtypes.items() if dt == object] ['a', 'b']
```
Or you can use select_dtypes to display a dataframe with only the strings:或者您可以使用 select_dtypes 显示仅包含字符串的数据框：
```
 df.select_dtypes(include=[object]) ab 0 a 1 1 bb 2 cc 3 d 2
```

Answer 2

Yes, its possible.是的，有可能。 You use dtype你使用dtype

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
    print('yes')
else:
    print('no')

You can also select your columns by dtype using select_dtypes您还可以使用select_dtypes按 dtype 选择列

df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset

Answer 3

4 years since the creation of this question and I believe there's still not a definitive answer.自提出这个问题4年以来，我相信仍然没有明确的答案。

I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0).我不认为字符串在 Pandas 中被视为一等公民（甚至 >= 1.0.0）。 As an example:举个例子：

import pandas as pd
import datetime

df = pd.DataFrame({
    'str': ['a', 'b', 'c', None],
    'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})

string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))

heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))

prints印刷

object
True
object
True

so although hete does not contain any explicit strings, it is considered as a string series.因此，虽然hete不包含任何显式字符串，但它被视为字符串系列。

After reading the documentation , I think the only way to make sure a series contains only strings is:阅读文档后，我认为确保系列仅包含字符串的唯一方法是：

def is_string_series(s : pd.Series):
    if isinstance(s.dtype, pd.StringDtype):
        # The series was explicitly created as a string series (Pandas>=1.0.0)
        return True
    elif s.dtype == 'object':
        # Object series, check each value
        return all((v is None) or isinstance(v, str) for v in s)
    else:
        return False

Answer 4

I use a 2-step approach: first to determine if dtype==object , and then if so, I got the first row of data to see if that column's data was a string or not.我使用两步方法：首先确定是否dtype==object ，然后如果是，我获取第一行数据以查看该列的数据是否为字符串。

c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
    # do something

Answer 5

Expanding on Scratch'N'Purr's answer:扩展 Scratch'N'Purr 的回答：

>>> df = pd.DataFrame({'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4]})
>>> df 
   a  b    c
0  a  1  NaN
1  b  b  2.0
2  c  c  3.0
3  d  2  4.0

>>> dict(filter(lambda x: x[1] != np.number, list(zip(df.columns, df.dtypes))))
{'a': dtype('O'), 'b': dtype('O')}

So I've added some columns with mixed types.所以我添加了一些混合类型的列。 You can see that the filter + dict approach yields key: value mappings of which columns have dtypes outside of the bounds of np.number .您可以看到filter + dict方法产生了 key: value 映射，其中列的 dtype 超出了np.number的范围。 This ought to work well at scale.这应该在规模上运作良好。 You could also try coercing each column to a specific type (eg int ) and then catch the ValueError exception when you can't convert a string column to int .您还可以尝试将每列强制转换为特定类型（例如int ），然后在无法将字符串列转换为int时捕获ValueError异常。 Lots of ways to do this.很多方法可以做到这一点。

Answer 6

With Pandas 1.0 convert_dtypes was introduced.在 Pandas 1.0 中引入了convert_dtypes 。 When a column was not explicitly created as StringDtype it can be easily converted.当列未显式创建为StringDtype时，可以轻松转换。

pd.StringDtype.is_dtype will then return True for wtring columns. pd.StringDtype.is_dtype然后将为 wstring 列返回True 。 Even when they contain NA values.即使它们包含 NA 值。

For old and new style strings the complete series of checks could be something like this:对于新旧风格的字符串，完整的检查系列可能是这样的：

def has_string_type(s: pd.Series) -> bool:
    if pd.StringDtype.is_dtype(s.dtype):
        # StringDtype extension type
        return True

    if s.dtype != "object":
        # No object column - definitely no string
        return False

    try:
        s.str
    except AttributeError:
        return False

    # The str accessor exists, this must be a String column
    return True

Answer 7

As far as I can tell, the only sure fire way to know what types are there is to check the values, then you can do an assertion to see if it's what you expect.据我所知，知道有哪些类型的唯一可靠方法是检查值，然后你可以做一个断言来看看它是否是你所期望的。

The below function will get the dtypes of each value in a column, drop duplicates and then cast to a list so you can view/interact with it.下面的函数将获取列中每个值的 dtypes，删除重复项，然后转换为列表，以便您可以查看/与之交互。 This will let you deal with mixed types, objects and NAs the way you wish (of course np.nan is of type float but I leave such things to the interested reader)这将让您以您希望的方式处理混合类型、对象和 NA（当然 np.nan 是 float 类型，但我将这些事情留给感兴趣的读者）

import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3, 4],
                   "col2": ["a", "b", "c", "d"],
                   "col3": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
                   })

print(df.dtypes.to_dict())
# {'col1': dtype('int64'), 'col2': dtype('O'), 'col3': dtype('O')}

def true_dtype(df): # You could add a column filter here too
    return {col: df[col].apply(lambda x: type(x)).unique().tolist() for col in df.columns}

true_types = true_dtype(df)
print(true_types)
# {'col1': [<class 'int'>], 'col2': [<class 'str'>], 'col3': [<class 'list'>]}

print(true_types['col2'] == [str])
# True

Answer 8

This will return a list of column name whose dtype is string(object in this case)这将返回 dtype 为字符串的列名列表（在这种情况下为对象）

#let df be your dataframe     
df.columns[df.dtypes==object].tolist()

Python：检查数据框列是否包含字符串类型

问题描述

8 个解决方案

解决方案1
14 2018-05-17 12:20:09

解决方案2
12 2017-03-27 15:13:10

解决方案3
11 2021-04-08 09:32:45

解决方案4
10 2019-09-19 20:36:55

解决方案5
1 2017-03-27 15:32:35

解决方案6
1 2021-12-08 07:35:13

解决方案7
0 2021-05-03 13:12:08

解决方案8
0 2021-08-24 16:17:11

Python：检查数据框列是否包含字符串类型

问题描述

8 个解决方案

解决方案1 14 2018-05-17 12:20:09

解决方案2 12 2017-03-27 15:13:10

解决方案3 11 2021-04-08 09:32:45

解决方案4 10 2019-09-19 20:36:55

解决方案5 1 2017-03-27 15:32:35

解决方案6 1 2021-12-08 07:35:13

解决方案7 0 2021-05-03 13:12:08

解决方案8 0 2021-08-24 16:17:11

解决方案1
14 2018-05-17 12:20:09

解决方案2
12 2017-03-27 15:13:10

解决方案3
11 2021-04-08 09:32:45

解决方案4
10 2019-09-19 20:36:55

解决方案5
1 2017-03-27 15:32:35

解决方案6
1 2021-12-08 07:35:13

解决方案7
0 2021-05-03 13:12:08

解决方案8
0 2021-08-24 16:17:11