如何用此文本中命名的列中包含的值替换列中的文本

Question

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).在 pyspark 中，我试图用列中存在的值替换列中的多个文本值，这些值存在于计算列（公式）中。

So to be clear, here is an example:所以要清楚，这是一个例子：

Input:输入：

|param_1|param_2|calc 
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1

Output needed: Output 需要：

|param_1|param_2|calc 
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3

In the column calc, the default value is a formula.在计算列中，默认值为公式。 It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)".它可以像上面提供的一样简单，也可以类似于“2*(param_8-param_4)/param_2-(param_3/param_7)”。 What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.我正在寻找的是用与名称相关的列中的值替换所有 param_x 的东西。

I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.我已经尝试了很多东西，但没有任何效果，而且大多数时候，当我将 replace 或 regex_replace 与用于替换值的列一起使用时，会出现列不可迭代的错误。

Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.此外，列 param_1、param_2、...、param_x 是动态生成的，计算列值可以是其中一些列，但不一定是全部。

Could you help me on the subject with a dynamic solution?你能用动态解决方案帮助我解决这个问题吗？

Thank you so much.太感谢了。 Best regards最好的祝福

Answer 1

Update: Turned out I misunderstood the requirement.更新：原来我误解了这个要求。 This would work:这会起作用：

for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
   df=df.withColumn("calc", F.expr(exp))

Input/Output:输入输出：

------- Keeping the below section for a while just for reference ------- ------- 以下部分暂时保留，仅供参考 ------

You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).您不能直接这样做 - 因为您将无法直接使用列值，除非您收集 python object（显然不推荐）。

This would work with the same:这将适用于相同的：

    df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");

    df.show()

    df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))

    as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}

    expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
            ELSE NULL END""";

    df.withColumn("Result", F.expr(expression)).show();

Input/Output:输入输出：

如何用此文本中命名的列中包含的值替换列中的文本

问题描述

1 个解决方案

解决方案1
1 2023-01-06 10:40:20

如何用此文本中命名的列中包含的值替换列中的文本

问题描述

1 个解决方案

解决方案1 1 2023-01-06 10:40:20

解决方案1
1 2023-01-06 10:40:20