简体   繁体   English

PySpark:根据列中的值和字典创建列

[英]PySpark: create column based on value and dictionary in columns

I have a PySpark dataframe with values and dictionaries that provide a textual mapping for the values.我有一个 PySpark dataframe,其中包含值和为值提供文本映射的字典。 Not every row has the same dictionary and the values can vary too.并非每一行都有相同的字典,值也可能不同。

| value    | dict                                           | 
| -------- | ---------------------------------------------- |
| 1        | {"1": "Text A", "2": "Text B"}                 |
| 2        | {"1": "Text A", "2": "Text B"}                 |
| 0        | {"0": "Another text A", "1": "Another text B"} |

I want to make a "status" column that contains the right mapping.我想创建一个包含正确映射的“状态”列。


| value    | dict                             | status   |
| -------- | -------------------------------  | -------- |
| 1        | {"1": "Text A", "2": "Text B"}   | Text A   |
| 2        | {"1": "Text A", "2": "Text B"}   | Text B   |
| 0        | {"0": "Other A", "1": "Other B"} | Other A  |

I have tried this code:我试过这段代码:

df.withColumn("status", F.col("dict").getItem(F.col("value"))

This code does not work.此代码不起作用。 With a hard coded value, like "2", the same code does provide an output, but of course not the right one:使用硬编码值,如“2”,相同的代码确实提供了 output,但当然不是正确的:

df.withColumn("status", F.col("dict").getItem("2"))

Could someone help me with getting the right mapped value in the status column?有人可以帮助我在状态列中获得正确的映射值吗?

EDIT: my code did work, except for the fact that my "value" was a double and the keys in dict are strings.编辑:我的代码确实有效,除了我的“值”是双精度值并且 dict 中的键是字符串。 When casting the column from double to int to string, the code works.将列从 double 转换为 int 到 string 时,代码有效。

Hope this helps.希望这可以帮助。

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json


if __name__ == '__main__':
    spark = SparkSession.builder.appName('Medium').master('local[1]').getOrCreate()
    df = spark.read.format('csv').option("header","true").option("delimiter","|").load("/Users/dshanmugam/Desktop/ss.csv")
    schema = StructType([
        StructField("1", StringType(), True)
    ])


    def return_value(data):
        key = data.split('-')[0]
        value = json.loads(data.split('-')[1])[key]
        return value

    returnVal = udf(return_value)
    df_new = df.withColumn("newCol",concat_ws("-",col("value"),col("dict"))).withColumn("result",returnVal(col("newCol")))
    df_new.select(["value","result"]).show(10,False)

Result:结果:

+-----+--------------+
|value|result        |
+-----+--------------+
|1    |Text A        |
|2    |Text B        |
|0    |Another text A|
+-----+--------------+

I am using UDF.我正在使用 UDF。 You can try with some other options if performance is a concern.如果性能是一个问题,您可以尝试其他一些选项。

Here are my 2 cents这是我的 2 美分

  1. Create the dataframe by reading from CSV or any other source (in my case it is just static data)通过读取 CSV 或任何其他来源创建 dataframe(在我的例子中它只是 static 数据)

     from pyspark.sql.types import * data = [ (1, {"1": "Text A", "2": "Text B"}), (2, {"1": "Text A", "2": "Text B"}), (0, {"0": "Another text A", "1": "Another text B"} ) ] schema = StructType([ StructField("ID",StringType(),True), StructField("Dictionary",MapType(StringType(),StringType()),True), ]) df = spark.createDataFrame(data,schema=schema) df.show(truncate=False)
  2. Then directly extract the dictionary value based on the id as a key.然后直接以id为key提取字典值。

     df.withColumn('extract',df.Dictionary[df.ID]).show(truncate=False)

Check the below image for reference:查看下图以供参考: 在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 基于分隔符拆分字符串列并为 Pyspark 中的每个值创建列 - Split string column based on delimiter and create columns for each value in Pyspark pyspark groupby 并创建包含其他列字典的列 - pyspark groupby and create column containing a dictionary of the others columns 根据其他列和字典创建一个新列 - Create a new column based on other columns and a dictionary 如何基于在PySpark中其他列中进行的计算来创建新列 - How to create a new column based on calculations made in other columns in PySpark 根据 pyspark 中的 groupby 过滤行创建具有最大值的新列 - Create new column with max value based on filtered rows with groupby in pyspark PySpark Dataframe根据函数返回值创建新列 - PySpark Dataframe create new column based on function return value 如何根据其他列的值创建列 - How to create a column based on the value of the other columns 根据列和另一个字典中的索引列表创建新的 dataframe 列 - Create new dataframe columns based on lists of indices in a column and another dictionary pyspark从两列中的数据创建字典 - pyspark create dictionary from data in two columns PySpark:根据与另一列中的字符串匹配的字典值创建新列 - PySpark: create new column based on dictionary values matching with string in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM