简体   繁体   English

如何将 VectorAssembler 输出的特征映射回 Spark ML 中的列名?

[英]How to map features from the output of a VectorAssembler back to the column names in Spark ML?

I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset.我正在尝试在 PySpark 中运行线性回归,并且我想创建一个包含汇总统计数据的表格,例如数据集中每一列的系数、P 值和 t 值。 However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler , and now for each row I have a single feature vector and the target column.但是,为了训练线性回归模型,我必须使用 Spark 的VectorAssembler创建一个特征向量,现在对于每一行,我都有一个特征向量和目标列。 When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of these statistics, and there's no way to know which attribute corresponds to which value, which is really difficult to figure out manually with a large number of columns.当我尝试访问 Spark 的内置回归汇总统计数据时,它们为我提供了每个统计数据的原始数字列表,并且无法知道哪个属性对应哪个值,这真的很难手动找出大量的列。 How do I map these values back to the column names?如何将这些值映射回列名?

For example, I have my current output as something like this:例如,我的当前输出是这样的:

Coefficients: [-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-1206.47228704,33.7078197705,99.9956812528]系数:[-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-14201787287878787878788888888888888888995

P-Value: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0] P 值:[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]

t-statistic: [-23.348593508995318, -44.72813283953004, 19.836508234714472, 144.49248881747755, -16.547272230754242, -9.560681351483941, -19.563547400189073, 1.3228378389036228, 1.0912415361190977, 20.383256127350474] t统计量:[-23.348593508995318,-44.72813283953004,19.836508234714472,144.49248881747755,-16.547272230754242,-9.560681351483941,-19.563547400189073,1.3228378389036228,1.0912415361190977,20.383256127350474]

Coefficient Standard Errors: [8.043646497811427, 4.182131353367049, 4.293682291754585, 73.32793120907755, 7.690626652102948, 4.108783841348964, 61.669402913526625, 25.481445101737247, 91.63478289909655, 609.7007361468519]系数的标准误差:[8.043646497811427,4.182131353367049,4.293682291754585,73.32793120907755,7.690626652102948,4.108783841348964,61.669402913526625,25.481445101737247,91.63478289909655,609.7007361468519]

These numbers mean nothing unless I know which attribute they correspond to.这些数字没有任何意义,除非我知道它们对应于哪个属性。 But in my DataFrame I only have one column called "features" which contains rows of sparse Vectors.但是在我的DataFrame我只有一列名为“特征”的列,其中包含稀疏向量行。

This is an ever bigger problem when I have one-hot encoded features, because if I have one variable with an encoding of length n, I will get n corresponding coefficients/p-values/t-values etc.当我有 one-hot 编码特征时,这是一个更大的问题,因为如果我有一个长度为 n 的编码的变量,我将得到 n 个相应的系数/p 值/t 值等。

As of today Spark doesn't provide any method that can do it for you, so if you have to create your own.截至今天,Spark 没有提供任何可以为您完成的方法,因此如果您必须创建自己的方法。 Let's say your data looks like this:假设您的数据如下所示:

import random
random.seed(1)

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

and is processed using following pipeline:并使用以下管道进行处理:

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

indexers = [
  StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]
encoders = [
    OneHotEncoder(
        inputCol=idx.getOutputCol(),
        outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]
assembler = VectorAssembler(
    inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],
    outputCol="features")

pipeline = Pipeline(
    stages=indexers + encoders + [assembler, LinearRegression()])
model = pipeline.fit(df)

Get the LinearRegressionModel :获取LinearRegressionModel

lrm = model.stages[-1]

Transform the data:转换数据:

transformed =  model.transform(df)

Extract and flatten ML attributes:提取和展平 ML 属性:

from itertools import chain

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*transformed
        .schema[lrm.summary.featuresCol]
        .metadata["ml_attr"]["attrs"].values())))

and map to the output:并映射到输出:

[(name, lrm.summary.pValues[idx]) for idx, name in attrs]
[('x1_idx_enc_a', 0.26400012641279824),
 ('x1_idx_enc_c', 0.06320192217171572),
 ('x2_idx_enc_foo', 0.40447778902400433),
 ('x3', 0.1081883594783335),
 ('x4', 0.4545851609776568)]
[(name, lrm.coefficients[idx]) for idx, name in attrs]
[('x1_idx_enc_a', 0.13874401585637453),
 ('x1_idx_enc_c', 0.23498565469334595),
 ('x2_idx_enc_foo', -0.083558932128022873),
 ('x3', 0.0030186112903237442),
 ('x4', -0.12951394186593695)]

You can see the actual order of the columns here您可以在此处查看列的实际顺序

df.schema["features"].metadata["ml_attr"]["attrs"]

there will be two classes usually, ["binary] & ["numeric"]通常会有两个类,["binary] & ["numeric"]

pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

Should give the exact order of all the columns应该给出所有列的确切顺序

Here's the one line answer:这是单行答案:

[x["name"] for x in sorted(train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["binary"]+
   train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["numeric"], 
   key=lambda x: x["idx"])]

Thanks to @pratiklodha for the core of this.感谢@pratiklodha 对此的核心。

what is sc in the start? 一开始的sc是什么? I want to replicate the same example in my environment. 我想在我的环境中复制相同的示例。 It throws an error SC is not defined. 引发错误SC未定义。

import random random.seed(1) 导入随机random.seed(1)

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从Spark Dataframe创建labledpoints以及如何将名称列表传递给VectorAssembler - Create labledpoints from Spark Dataframe & how to pass list of names to VectorAssembler 如何使用 SHAP 从 ML 模型中提取最重要的特征 - 为什么我的所有列名都是空的? - How to extract the most important features from a ML model using SHAP - why are all my column names empty? AttributeError:“ DataFrame”对象在VectorAssembler spark ML上没有属性“ get” - AttributeError: 'DataFrame' object has no attribute 'get' on VectorAssembler spark ML 如何使用 ML 算法根据数据的现有特征对数据(作为新列)进行评分或排名? - How to make scoring or ranking data (as new column) from existing features on data by using ML algorithms? 如何从python中的索引取回列名? - How to get back column names from indices in python? 如何快速将列名称从df映射到另一个? - How to map column names from df to another rapidly? 如何使用 pyspark VectorAssembler - How to use pyspark VectorAssembler 将ML VectorUDT功能从.mllib转换为.ml类型以进行线性回归 - Convert ML VectorUDT features from .mllib to .ml type for linear regression 如何将字典转换为火花图输出 - How to convert dict to spark map output 是否应该从 ML 模型中删除相关的特征? - Should features that correlate be deleted from ML models?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM