简体   繁体   中英

Changing nested column names using SparklyR in R

I have referred to all the links mentioned here:

1) Link-1 2) Link-2 3) Link-3 4) Link-4

Following R code has been written by using Sparklyr Package. It reads huge JSON file and creates database schema.

sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, 
                              memory = FALSE, overwrite = TRUE) # reads JSON file
sample_tbl <- sdf_schema_viewer(sample_tbl) # to create db schema
df <- tbl(sc,"example") # to create lookup table

It has created following database schema

数据库架构

Now,

If I rename first level column, then it works.

For example,

df %>% rename(ent = entities)

But when I run 2nd deep level nested column then it doesn't rename.

df %>% rename(e_hashtags = entities.hashtags)

It shows error:

Error in .f(.x[[i]], ...) : object 'entities.hashtags' not found

Question

My question is, how to rename 3rd to 4th deep level nested column also?

Please refer database schema mentioned above.

Spark as such doesn't support renaming individual nested fields. You have to either cast or rebuild a whole structure. For simplicity let's assume that data looks as follows:

cat('{"contributors": "foo", "coordinates": "bar", "entities": {"hashtags": ["foo", "bar"], "media": "missing"}}',  file = "/tmp/example.json")
df <- spark_read_json(sc, "df", "/tmp/example.json", overwrite=TRUE)

df %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- media: string (nullable = true)

with simple string representation:

df %>% 
  spark_dataframe() %>% 
  invoke("schema") %>% 
  invoke("simpleString") %>% 
  cat(sep = "\n")
struct<contributors:string,coordinates:string,entities:struct<hashtags:array<string>,media:string>>

With cast you have to define expression using matching type description:

expr_cast <- invoke_static(
  sc, "org.apache.spark.sql.functions", "expr",
  "CAST(entities AS struct<e_hashtags:array<string>,media:string>)"
)

df_cast <- df %>% 
  spark_dataframe() %>% 
  invoke("withColumn", "entities", expr_cast) %>% 
  sdf_register()

df_cast %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- e_hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- media: string (nullable = true)

To rebuild structure you have to match all components:

expr_struct <- invoke_static(
  sc, "org.apache.spark.sql.functions", "expr",
  "struct(entities.hashtags AS e_hashtags, entities.media)"
)

df_struct <- df %>% 
  spark_dataframe() %>% 
  invoke("withColumn", "entities", expr_struct) %>% 
  sdf_register()

df_struct %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()
root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- entities: struct (nullable = false)
 |    |-- e_hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- media: string (nullable = true)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM