使用列将NumPy数组转换为Pandas Dataframe

Question

I want to normalize my both categorical and numeric values. 我想规范化我的分类和数值。

cols = df.columns.values.tolist()
df_num = df.drop(CAT_COLUMNS, axis=1)
df_num = df_num.as_matrix()
df_num = preprocessing.StandardScaler().fit_transform(df_num)

df.fillna('NA', inplace=True)
df_cat = df.T.to_dict().values()

vec_cat = DictVectorizer( sparse=False )
df_cat = vec_cat.fit_transform(df_cat)

After that I need to combine 2 numpy arrays back to pandas dataframe, but below approach doesn't work for me. 之后我需要将2个numpy数组合并回pandas数据帧，但是下面的方法对我来说不起作用。

mas = np.hstack((df_num, df_cat))
df = pd.DataFrame(data=mas, columns=cols)

Error Message: ValueError: Shape of passed values is (475, 243), indices imply (83, 243) 错误消息： ValueError: Shape of passed values is (475, 243), indices imply (83, 243)

One more approach: 还有一种方法：

columns = df.columns.values.tolist()
for col in columns:
    try:
        if col in CAT_COLUMNS:
            df[col] = pd.get_dummies(df[col])
        else:
            df[col] = df[col].apply(preprocessing.StandardScaler().fit)
    except Exception, err:
        print 'Column: %s and msg=%s' % (col, err.message)

Error Message: 错误信息：

Column: DATE and msg=Singleton array array(1444424400.0) cannot be considered a valid collection. Column: QTR_HR_START and msg=Singleton array array(21600000L, dtype=int64) cannot be considered a valid collection. ...

PS. PS。 Is there any way to avoid numpy et all? 有没有办法避免numpy et all？ As example, I want to leverage on pandas_ml library 例如，我想利用pandas_ml库

Answer 1

What you are looking for is pandas.get_dummies() . 你要找的是pandas.get_dummies() 。 It will perform one hot encoding on categorical columns, and produce a dataframe as the result. 它将对分类列执行一次热编码，并生成数据帧作为结果。 From there you can use pandas.concat([existing_df, new_df],axis=0) to add the new columns to your existing dataframe. 从那里，您可以使用pandas.concat([existing_df, new_df],axis=0)将新列添加到现有数据框中。 This will avoid the use of a numpy array. 这样可以避免使用numpy数组。

An example of how it could be used: 如何使用它的一个例子：

for cat_column in CAT_COLUMNS:
    dummy_df = pd.get_dummies(df[column])

    #Optionally rename columns to indicate categorical feature name
    dummy_df.columns = ["%s_%s" % (cat_column, col) for col in dummy_df.columns]
    df = pd.concat([df, dummy_df], axis=1)

Answer 2

What about pretty simple following approach? 那么简单的以下方法怎么样？

def normalize_dataframe(df):
    columns = df.columns.values.tolist()
    for col in columns:
        try:
            if col in CAT_COLUMNS:
                df[col] = pd.get_dummies(df[col])
            else:
                df[col] = preprocessing.StandardScaler().fit_transform(df[col])
        except Exception, err:
            print 'Column: %s and msg=%s' % (col, err.message)
    return df

使用列将NumPy数组转换为Pandas Dataframe

问题描述

2 个解决方案

解决方案1
2 2015-12-25 16:55:02

解决方案2
0 已采纳 2015-12-28 18:03:14

使用列将NumPy数组转换为Pandas Dataframe

问题描述

2 个解决方案

解决方案1 2 2015-12-25 16:55:02

解决方案2 0 已采纳 2015-12-28 18:03:14

解决方案1
2 2015-12-25 16:55:02

解决方案2
0 已采纳 2015-12-28 18:03:14