如何将 VectorAssembler 输出的特征映射回 Spark ML 中的列名？

Question

I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset.我正在尝试在 PySpark 中运行线性回归，并且我想创建一个包含汇总统计数据的表格，例如数据集中每一列的系数、P 值和 t 值。 However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler , and now for each row I have a single feature vector and the target column.但是，为了训练线性回归模型，我必须使用 Spark 的VectorAssembler创建一个特征向量，现在对于每一行，我都有一个特征向量和目标列。 When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of these statistics, and there's no way to know which attribute corresponds to which value, which is really difficult to figure out manually with a large number of columns.当我尝试访问 Spark 的内置回归汇总统计数据时，它们为我提供了每个统计数据的原始数字列表，并且无法知道哪个属性对应哪个值，这真的很难手动找出大量的列。 How do I map these values back to the column names?如何将这些值映射回列名？

For example, I have my current output as something like this:例如，我的当前输出是这样的：

Coefficients: [-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-1206.47228704,33.7078197705,99.9956812528]系数：[-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-14201787287878787878788888888888888888995

P-Value: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0] P 值：[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]

t-statistic: [-23.348593508995318, -44.72813283953004, 19.836508234714472, 144.49248881747755, -16.547272230754242, -9.560681351483941, -19.563547400189073, 1.3228378389036228, 1.0912415361190977, 20.383256127350474] t统计量：[-23.348593508995318，-44.72813283953004，19.836508234714472，144.49248881747755，-16.547272230754242，-9.560681351483941，-19.563547400189073，1.3228378389036228，1.0912415361190977，20.383256127350474]

Coefficient Standard Errors: [8.043646497811427, 4.182131353367049, 4.293682291754585, 73.32793120907755, 7.690626652102948, 4.108783841348964, 61.669402913526625, 25.481445101737247, 91.63478289909655, 609.7007361468519]系数的标准误差：[8.043646497811427，4.182131353367049，4.293682291754585，73.32793120907755，7.690626652102948，4.108783841348964，61.669402913526625，25.481445101737247，91.63478289909655，609.7007361468519]

These numbers mean nothing unless I know which attribute they correspond to.这些数字没有任何意义，除非我知道它们对应于哪个属性。 But in my DataFrame I only have one column called "features" which contains rows of sparse Vectors.但是在我的DataFrame我只有一列名为“特征”的列，其中包含稀疏向量行。

This is an ever bigger problem when I have one-hot encoded features, because if I have one variable with an encoding of length n, I will get n corresponding coefficients/p-values/t-values etc.当我有 one-hot 编码特征时，这是一个更大的问题，因为如果我有一个长度为 n 的编码的变量，我将得到 n 个相应的系数/p 值/t 值等。

Answer 1

As of today Spark doesn't provide any method that can do it for you, so if you have to create your own.截至今天，Spark 没有提供任何可以为您完成的方法，因此如果您必须创建自己的方法。 Let's say your data looks like this:假设您的数据如下所示：

import random
random.seed(1)

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

and is processed using following pipeline:并使用以下管道进行处理：

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

indexers = [
  StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]
encoders = [
    OneHotEncoder(
        inputCol=idx.getOutputCol(),
        outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]
assembler = VectorAssembler(
    inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],
    outputCol="features")

pipeline = Pipeline(
    stages=indexers + encoders + [assembler, LinearRegression()])
model = pipeline.fit(df)

Get the LinearRegressionModel :获取LinearRegressionModel ：

lrm = model.stages[-1]

Transform the data:转换数据：

transformed =  model.transform(df)

Extract and flatten ML attributes:提取和展平 ML 属性：

from itertools import chain

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*transformed
        .schema[lrm.summary.featuresCol]
        .metadata["ml_attr"]["attrs"].values())))

and map to the output:并映射到输出：

[(name, lrm.summary.pValues[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.26400012641279824),
 ('x1_idx_enc_c', 0.06320192217171572),
 ('x2_idx_enc_foo', 0.40447778902400433),
 ('x3', 0.1081883594783335),
 ('x4', 0.4545851609776568)]

[(name, lrm.coefficients[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.13874401585637453),
 ('x1_idx_enc_c', 0.23498565469334595),
 ('x2_idx_enc_foo', -0.083558932128022873),
 ('x3', 0.0030186112903237442),
 ('x4', -0.12951394186593695)]

Answer 2

You can see the actual order of the columns here您可以在此处查看列的实际顺序

df.schema["features"].metadata["ml_attr"]["attrs"]

there will be two classes usually, ["binary] & ["numeric"]通常会有两个类，["binary] & ["numeric"]

pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

Should give the exact order of all the columns应该给出所有列的确切顺序

Answer 3

Here's the one line answer:这是单行答案：

[x["name"] for x in sorted(train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["binary"]+
   train_downsampled.schema["all_features"].metadata["ml_attr"]["attrs"]["numeric"], 
   key=lambda x: x["idx"])]

Thanks to @pratiklodha for the core of this.感谢@pratiklodha 对此的核心。

Answer 4

what is sc in the start? 一开始的sc是什么？ I want to replicate the same example in my environment. 我想在我的环境中复制相同的示例。 It throws an error SC is not defined. 引发错误SC未定义。

import random random.seed(1) 导入随机random.seed（1）

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

如何将 VectorAssembler 输出的特征映射回 Spark ML 中的列名？

问题描述

3 个解决方案

解决方案1
23 2017-03-22 18:22:12

解决方案2
12 2018-02-05 13:46:32

解决方案3
1 2018-08-06 16:42:25

解决方案4
0 2019-11-16 12:02:04

如何将 VectorAssembler 输出的特征映射回 Spark ML 中的列名？

问题描述

3 个解决方案

解决方案1 23 2017-03-22 18:22:12

解决方案2 12 2018-02-05 13:46:32

解决方案3 1 2018-08-06 16:42:25

解决方案4 0 2019-11-16 12:02:04

解决方案1
23 2017-03-22 18:22:12

解决方案2
12 2018-02-05 13:46:32

解决方案3
1 2018-08-06 16:42:25

解决方案4
0 2019-11-16 12:02:04