pandas 数据帧中的单独数字和分类变量

Question

I have a huge list of data in spark, and I took its headers only and saved in in the pandas dataframe.我在 spark 中有一个庞大的数据列表，我只取了它的标题并保存在 pandas dataframe 中。

Now I want to make different list out of it to separate categorical and numerical现在我想从中列出不同的列表来区分分类和数字

df2 = df.dtypes
df3 = pd.DataFrame(df2)
print(df3)

df4= df3.filter(df3[1] = 'String')

this statemnet gives error:这个statemnet给出了错误：

SyntaxError: keyword can't be an expression SyntaxError：关键字不能是表达式

Answer 1

You don't need Pandas, use pySpark dataframe.describe() to find all numeric and string columns (this will skip columns types like date , timestamp , array , struct etc.) and then filter out StringType() columns using information from df.dtypes:您不需要 Pandas，使用 pySpark Z6A8064B5DF479455500553C47C5500553C47C55057DZ.describe()来查找所有数字和字符串过滤列（这将跳过列类型，例如使用date ， timestamp out StringType 等列信息）然后struct输出StringType（） .dtypes：

from datetime import datetime
df = spark.createDataFrame([ (1, 12.3, 1.5, 'test', 13.23, datetime(2019,9,23)) ], ['i1', 'd2', 'f3', 's4', 'd5', 'dt'])
# DataFrame[i1: bigint, d2: double, f3: double, s4: string, d5: double, dt: timestamp]

# find all numeric and string columns from df (remove the first column which is `summary`)
cols = df.limit(100).describe().columns[1:]
# ['i1', 'd2', 'f3', 's4', 'd5'] 

# get a mapping of column vs dtypes of the df:
dtype_mapping = dict(df.dtypes)
#{'d2': 'double',
# 'd5': 'double',
# 'dt': 'timestamp',
# 'f3': 'double',
# 'i1': 'bigint',
# 's4': 'string'}

# filter out string-type from cols using the above mapping:
numeric_cols = [ c for c in cols if dtype_mapping[c] != 'string' ]
# ['i1', 'd2', 'f3', 'd5']

Answer 2

Along with that necessary another '=', you are missing a few things, like the index you are trying to access is '0' not '1'.除了必要的另一个“=”之外，您还缺少一些东西，例如您尝试访问的索引是“0”而不是“1”。 Also, there is no data type 'String' in pandas DataFrame, it is 'object'.此外，pandas DataFrame 中没有数据类型“字符串”，它是“对象”。 You may try something like this:你可以尝试这样的事情：

df2 = df.dtypes
df3 = pd.DataFrame(df2)
print(df3)
df4 = df3.filter(df3.iloc[:,0] == 'object')

Answer 3

you can get non numeric columns from dataframe like this您可以像这样从 dataframe 获取非数字列

df.loc[:,df.dtypes==np.object]

pandas 数据帧中的单独数字和分类变量

问题描述

3 个解决方案

解决方案1
1 2019-09-24 11:34:51

解决方案2
0 2019-09-24 09:03:56

解决方案3
0 2019-09-24 10:23:11

pandas 数据帧中的单独数字和分类变量

问题描述

3 个解决方案

解决方案1 1 2019-09-24 11:34:51

解决方案2 0 2019-09-24 09:03:56

解决方案3 0 2019-09-24 10:23:11

解决方案1
1 2019-09-24 11:34:51

解决方案2
0 2019-09-24 09:03:56

解决方案3
0 2019-09-24 10:23:11