简体   繁体   中英

How to query/extract array elements from within a pyspark dataframe

I am trying to build a data frame that will incorporate data in an array. The array is a 2 value array with a index and value pair. Not every index value is present for every row in the data frame:array. Here is what the schema looks like

root
|-- visitNumber: string (nullable = true)
|-- visitId: string (nullable = true)
|-- customDimensions: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- index: string (nullable = true)
|    |    |-- value: string (nullable = true)

There are many other columns, but those are not involved in the question.

Here is an example of the customDimensions array in terms of data:

[[1, ],[2, ],[3,"apple"],[6,"1-111-32"],[42, ],[5, ]]

What I am trying to accomplish is to incorporate columns that hold the value for a particular index. For instance:

df = df.withColumn("index6", *stuff to get the value at index 6*)

This would be a repeatable iteration as there is data throughout the 'customDimensions' that holds required data that we can "flatten" and express as separate columns.

This is achievable using udf. The following should work.

# import dependencies, start spark-session
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, lit
spark = SparkSession.builder.appName('my-app').master('local[2]').getOrCreate()

# data preparation
df = spark.createDataFrame(
        [
            (1, 'id-1', [['1', ],['2', ],['3', 'apple'],[6, '1-111-32'],[42, ],[5, ]]),
            (2, 'id-2', [['1', ],['2', ],['3', 'apple'],[6, ],          [42, ],[5, ]]),
            (3, 'id-3', [['1', ],['2', ],['3', 'apple'],[6, '2-111-32'],[42, ],[5, ]]),
        ]
        , schema=(
            'visitNumber', 'visitId', 'customDimensions'))

df.show(3, False)
+-----------+-------+------------------------------------------------+
|visitNumber|visitId|customDimensions                                |
+-----------+-------+------------------------------------------------+
|1          |id-1   |[[1], [2], [3, apple], [6, 1-111-32], [42], [5]]|
|2          |id-2   |[[1], [2], [3, apple], [6],           [42], [5]]|
|3          |id-3   |[[1], [2], [3, apple], [6, 2-111-32], [42], [5]]|
+-----------+-------+------------------------------------------------+

Now, let's prepare the udf

def user_func(x, y):
    elem_set = {elem[1] if elem[0] == y and len(elem) == 2 else None for elem in x} - {None}
    return None if not elem_set else elem_set.pop()

user_func_udf = udf(lambda x, y: user_func(x, y))
spark.udf.register("user_func_udf", user_func_udf)

Assuming we need a column for fetching the value at index 6, the following should work.

df_2 = df.withColumn('valAtIndex6', user_func_udf(df.customDimensions, lit('6')))
df_2.show(3, False)
+-----------+-------+------------------------------------------------+-----------+
|visitNumber|visitId|customDimensions                                |valAtIndex6|
+-----------+-------+------------------------------------------------+-----------+
|1          |id-1   |[[1], [2], [3, apple], [6, 1-111-32], [42], [5]]|1-111-32   |
|2          |id-2   |[[1], [2], [3, apple], [6], [42], [5]]          |null       |
|3          |id-3   |[[1], [2], [3, apple], [6, 2-111-32], [42], [5]]|2-111-32   |
+-----------+-------+------------------------------------------------+-----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM