How to query/extract array elements from within a pyspark dataframe

Question

I am trying to build a data frame that will incorporate data in an array. The array is a 2 value array with a index and value pair. Not every index value is present for every row in the data frame:array. Here is what the schema looks like

root
|-- visitNumber: string (nullable = true)
|-- visitId: string (nullable = true)
|-- customDimensions: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- index: string (nullable = true)
|    |    |-- value: string (nullable = true)

There are many other columns, but those are not involved in the question.

Here is an example of the customDimensions array in terms of data:

[[1, ],[2, ],[3,"apple"],[6,"1-111-32"],[42, ],[5, ]]

What I am trying to accomplish is to incorporate columns that hold the value for a particular index. For instance:

df = df.withColumn("index6", *stuff to get the value at index 6*)

This would be a repeatable iteration as there is data throughout the 'customDimensions' that holds required data that we can "flatten" and express as separate columns.

Answer 1

This is achievable using udf. The following should work.

# import dependencies, start spark-session
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, lit
spark = SparkSession.builder.appName('my-app').master('local[2]').getOrCreate()

# data preparation
df = spark.createDataFrame(
        [
            (1, 'id-1', [['1', ],['2', ],['3', 'apple'],[6, '1-111-32'],[42, ],[5, ]]),
            (2, 'id-2', [['1', ],['2', ],['3', 'apple'],[6, ],          [42, ],[5, ]]),
            (3, 'id-3', [['1', ],['2', ],['3', 'apple'],[6, '2-111-32'],[42, ],[5, ]]),
        ]
        , schema=(
            'visitNumber', 'visitId', 'customDimensions'))

df.show(3, False)
+-----------+-------+------------------------------------------------+
|visitNumber|visitId|customDimensions                                |
+-----------+-------+------------------------------------------------+
|1          |id-1   |[[1], [2], [3, apple], [6, 1-111-32], [42], [5]]|
|2          |id-2   |[[1], [2], [3, apple], [6],           [42], [5]]|
|3          |id-3   |[[1], [2], [3, apple], [6, 2-111-32], [42], [5]]|
+-----------+-------+------------------------------------------------+

Now, let's prepare the udf

def user_func(x, y):
    elem_set = {elem[1] if elem[0] == y and len(elem) == 2 else None for elem in x} - {None}
    return None if not elem_set else elem_set.pop()

user_func_udf = udf(lambda x, y: user_func(x, y))
spark.udf.register("user_func_udf", user_func_udf)

Assuming we need a column for fetching the value at index 6, the following should work.

df_2 = df.withColumn('valAtIndex6', user_func_udf(df.customDimensions, lit('6')))
df_2.show(3, False)
+-----------+-------+------------------------------------------------+-----------+
|visitNumber|visitId|customDimensions                                |valAtIndex6|
+-----------+-------+------------------------------------------------+-----------+
|1          |id-1   |[[1], [2], [3, apple], [6, 1-111-32], [42], [5]]|1-111-32   |
|2          |id-2   |[[1], [2], [3, apple], [6], [42], [5]]          |null       |
|3          |id-3   |[[1], [2], [3, apple], [6, 2-111-32], [42], [5]]|2-111-32   |
+-----------+-------+------------------------------------------------+-----------+

How to query/extract array elements from within a pyspark dataframe

Question

1 answers

solution1
0 2020-06-05 07:15:04

How to query/extract array elements from within a pyspark dataframe

Question

1 answers

solution1 0 2020-06-05 07:15:04

solution1
0 2020-06-05 07:15:04