简体   繁体   English

Python:检查数据框列是否包含字符串类型

[英]Python: Check if dataframe column contain string type

I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes.我想检查数据框中的列是否由字符串组成,以便我可以用数字标记它们以用于机器学习。 Some columns consists of numbers, I dont want to change them.有些列由数字组成,我不想更改它们。 Columns example can be seen below:列示例如下所示:

TRAIN FEATURES
  Age              Level  
  32.0              Silver      
  61.0              Silver  
  66.0              Silver      
  36.0              Gold      
  20.0              Silver     
  29.0              Silver     
  46.0              Silver  
  27.0              Silver      

Thank you=)谢谢你=)

Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.请注意,上述答案将包括 DateTime、TimeStamp、Category和其他数据类型。

Using object is more restrictive (although I am not sure if other dtypes would also of object dtype):使用object更具限制性(尽管我不确定其他dtypes是否也属于object dtype):

  1. Create the dataframe:创建数据框:

     df = pd.DataFrame({ 'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4], 'd': ['A', 'B', 'B', 'A'], 'e': pd.to_datetime('today')}) df['d'] = df['d'].astype('category')

That will look like this:看起来像这样:

   a  b    c  d          e
0  a  1  NaN  A 2018-05-17
1  b  b  2.0  B 2018-05-17
2  c  c  3.0  B 2018-05-17
3  d  2  4.0  A 2018-05-17
  1. You can check the types calling dtypes :您可以检查调用dtypes的类型:

     df.dtypes a object b object c float64 d category e datetime64[ns] dtype: object
  2. You can list the strings columns using the items() method and filtering by object :您可以使用items()方法列出字符串列并按object过滤:

     > [ col for col, dt in df.dtypes.items() if dt == object] ['a', 'b']
  3. Or you can use select_dtypes to display a dataframe with only the strings:或者您可以使用 select_dtypes 显示仅包含字符串的数据框:

     df.select_dtypes(include=[object]) ab 0 a 1 1 bb 2 cc 3 d 2

Yes, its possible.是的,有可能。 You use dtype你使用dtype

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
    print('yes')
else:
    print('no')

You can also select your columns by dtype using select_dtypes您还可以使用select_dtypes按 dtype 选择列

df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset

4 years since the creation of this question and I believe there's still not a definitive answer.自提出这个问题4年以来,我相信仍然没有明确的答案。

I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0).我不认为字符串在 Pandas 中被视为一等公民(甚至 >= 1.0.0)。 As an example:举个例子:

import pandas as pd
import datetime

df = pd.DataFrame({
    'str': ['a', 'b', 'c', None],
    'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})

string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))

heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))

prints印刷

object
True
object
True

so although hete does not contain any explicit strings, it is considered as a string series.因此,虽然hete不包含任何显式字符串,但它被视为字符串系列。

After reading the documentation , I think the only way to make sure a series contains only strings is:阅读文档后,我认为确保系列仅包含字符串的唯一方法是:

def is_string_series(s : pd.Series):
    if isinstance(s.dtype, pd.StringDtype):
        # The series was explicitly created as a string series (Pandas>=1.0.0)
        return True
    elif s.dtype == 'object':
        # Object series, check each value
        return all((v is None) or isinstance(v, str) for v in s)
    else:
        return False

I use a 2-step approach: first to determine if dtype==object , and then if so, I got the first row of data to see if that column's data was a string or not.我使用两步方法:首先确定是否dtype==object ,然后如果是,我获取第一行数据以查看该列的数据是否为字符串。

c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
    # do something 

Expanding on Scratch'N'Purr's answer:扩展 Scratch'N'Purr 的回答:

>>> df = pd.DataFrame({'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4]})
>>> df 
   a  b    c
0  a  1  NaN
1  b  b  2.0
2  c  c  3.0
3  d  2  4.0

>>> dict(filter(lambda x: x[1] != np.number, list(zip(df.columns, df.dtypes))))
{'a': dtype('O'), 'b': dtype('O')}

So I've added some columns with mixed types.所以我添加了一些混合类型的列。 You can see that the filter + dict approach yields key: value mappings of which columns have dtypes outside of the bounds of np.number .您可以看到filter + dict方法产生了 key: value 映射,其中列的 dtype 超出了np.number的范围。 This ought to work well at scale.这应该在规模上运作良好。 You could also try coercing each column to a specific type (eg int ) and then catch the ValueError exception when you can't convert a string column to int .您还可以尝试将每列强制转换为特定类型(例如int ),然后在无法将字符串列转换为int时捕获ValueError异常。 Lots of ways to do this.很多方法可以做到这一点。

With Pandas 1.0 convert_dtypes was introduced.在 Pandas 1.0 中引入了convert_dtypes When a column was not explicitly created as StringDtype it can be easily converted.当列未显式创建为StringDtype时,可以轻松转换。

pd.StringDtype.is_dtype will then return True for wtring columns. pd.StringDtype.is_dtype然后将为 wstring 列返回True Even when they contain NA values.即使它们包含 NA 值。

For old and new style strings the complete series of checks could be something like this:对于新旧风格的字符串,完整的检查系列可能是这样的:

def has_string_type(s: pd.Series) -> bool:
    if pd.StringDtype.is_dtype(s.dtype):
        # StringDtype extension type
        return True

    if s.dtype != "object":
        # No object column - definitely no string
        return False

    try:
        s.str
    except AttributeError:
        return False

    # The str accessor exists, this must be a String column
    return True

As far as I can tell, the only sure fire way to know what types are there is to check the values, then you can do an assertion to see if it's what you expect.据我所知,知道有哪些类型的唯一可靠方法是检查值,然后你可以做一个断言来看看它是否是你所期望的。

The below function will get the dtypes of each value in a column, drop duplicates and then cast to a list so you can view/interact with it.下面的函数将获取列中每个值的 dtypes,删除重复项,然后转换为列表,以便您可以查看/与之交互。 This will let you deal with mixed types, objects and NAs the way you wish (of course np.nan is of type float but I leave such things to the interested reader)这将让您以您希望的方式处理混合类型、对象和 NA(当然 np.nan 是 float 类型,但我将这些事情留给感兴趣的读者)

import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3, 4],
                   "col2": ["a", "b", "c", "d"],
                   "col3": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
                   })

print(df.dtypes.to_dict())
# {'col1': dtype('int64'), 'col2': dtype('O'), 'col3': dtype('O')}

def true_dtype(df): # You could add a column filter here too
    return {col: df[col].apply(lambda x: type(x)).unique().tolist() for col in df.columns}

true_types = true_dtype(df)
print(true_types)
# {'col1': [<class 'int'>], 'col2': [<class 'str'>], 'col3': [<class 'list'>]}

print(true_types['col2'] == [str])
# True

This will return a list of column name whose dtype is string(object in this case)这将返回 dtype 为字符串的列名列表(在这种情况下为对象)

#let df be your dataframe     
df.columns[df.dtypes==object].tolist()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 检查 dataframe 列是否包含 Python 中的字符串类型 - Check if dataframe column contain string type in Python 检查字符串类型的熊猫数据框列 - Check pandas dataframe column for string type Python 如何检查 pandas 中的 DataFrame 列是否包含范围内的所有数字 - Python How to check if a DataFrame column in pandas contain all the numbers in a range if 语句/条件检查 dataframe 列是否有 Python 中的字符串 - if statement/conditions to check a dataframe column for a string in Python python pandas-检查列中是否存在字符串类型 - python pandas - check if a string type exists in a column 更新 Dataframe python 检查列中的字符串是否在另一列中 - Update for Dataframe python check if string in column is in another column Python Pandas DataFrame检查字符串是否为其他字符串并填充列 - Python Pandas DataFrame check if string is other string and fill column Pyspark 过滤 dataframe 如果列不包含字符串 - Pyspark filter dataframe if column does not contain string python检查数据框列是否包含具有特定长度的字符串 - python check if dataframe column contains string with specific length 如何检查Python中的列表中是否存在DataFrame字符串列的第一个单词? - How to check if first word of a DataFrame string column is present in a List in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM