sparklyr spark_read_parquet以列表形式读取字符串字段

Question

我有很多以拼花形式包含的Hive文件，其中包含string和double列。 我可以使用以下语法将其中的大多数内容读取到带有sparklyr的Spark数据帧中：

spark_read_parquet(sc, name = "name", path = "path", memory = FALSE)

但是，我读取了一个文件，其中所有string值都转换为无法识别的列表，这些列表收集到R数据框中并打印后如下所示：

s_df <- spark_read_parquet(sc, 
                           name = "s_df", 
                           path = "hdfs://nameservice1/user/hive/warehouse/s_df", 
                           memory = FALSE)
df <- collect(s_df)
head(df)

# A tibble: 11,081 x 13
   provid   hospital_name servcode  servcode_desc codegroup claimid  amountpaid
   <list>   <list>        <list>    <list>        <list>    <list>        <dbl>
 1 <raw [8… <raw [32]>    <raw [5]> <raw [25]>    <raw [29… <raw [1…       7.41
 2 <raw [8… <raw [32]>    <raw [5]> <raw [15]>    <raw [22… <raw [1…       4.93
 3 <raw [8… <raw [32]>    <raw [5]> <raw [28]>    <raw [22… <raw [1…       5.36
 4 <raw [8… <raw [32]>    <raw [5]> <raw [28]>    <raw [30… <raw [1…       5.46
 5 <raw [8… <raw [32]>    <raw [5]> <raw [16]>    <raw [30… <raw [1…       2.80

df的前5行的hospital_name应该显示为METHODIST HOSPITAL OF SOUTHERN CALIFORNIA ，但是却像这样出来：

head(df$hospital_name)

[[1]]
 [1] 48 45 4e 52 59 20 4d 41 59 4f 20 4e 45 57 48 41 4c 4c 20 4d 45 4d 4f 52 49
[26] 41 4c 20 48 4f 53 50

[[2]]
 [1] 48 45 4e 52 59 20 4d 41 59 4f 20 4e 45 57 48 41 4c 4c 20 4d 45 4d 4f 52 49
[26] 41 4c 20 48 4f 53 50

[[3]]
 [1] 48 45 4e 52 59 20 4d 41 59 4f 20 4e 45 57 48 41 4c 4c 20 4d 45 4d 4f 52 49
[26] 41 4c 20 48 4f 53 50

[[4]]
 [1] 48 45 4e 52 59 20 4d 41 59 4f 20 4e 45 57 48 41 4c 4c 20 4d 45 4d 4f 52 49
[26] 41 4c 20 48 4f 53 50

[[5]]
 [1] 48 45 4e 52 59 20 4d 41 59 4f 20 4e 45 57 48 41 4c 4c 20 4d 45 4d 4f 52 49
[26] 41 4c 20 48 4f 53 50

我尝试了以下解决方案，但没有成功：

head(df %>% mutate(hospital_name = as.character(hospital_name)))

[1] "as.raw(c(0x48, 0x45, 0x4e, 0x52, 0x59, 0x20, 0x4d, 0x41, 0x59, 0x4f, 0x20, 0x4e, 0x45, 0x57, 0x48, 0x41, 0x4c, 0x4c, 0x20, 0x4d, 0x45, 0x4d, 0x4f, 0x52, 0x49, 0x41, 0x4c, 0x20, 0x48, 0x4f, 0x53, 0x50))"
[2] "as.raw(c(0x48, 0x45, 0x4e, 0x52, 0x59, 0x20, 0x4d, 0x41, 0x59, 0x4f, 0x20, 0x4e, 0x45, 0x57, 0x48, 0x41, 0x4c, 0x4c, 0x20, 0x4d, 0x45, 0x4d, 0x4f, 0x52, 0x49, 0x41, 0x4c, 0x20, 0x48, 0x4f, 0x53, 0x50))"
[3] "as.raw(c(0x48, 0x45, 0x4e, 0x52, 0x59, 0x20, 0x4d, 0x41, 0x59, 0x4f, 0x20, 0x4e, 0x45, 0x57, 0x48, 0x41, 0x4c, 0x4c, 0x20, 0x4d, 0x45, 0x4d, 0x4f, 0x52, 0x49, 0x41, 0x4c, 0x20, 0x48, 0x4f, 0x53, 0x50))"
[4] "as.raw(c(0x48, 0x45, 0x4e, 0x52, 0x59, 0x20, 0x4d, 0x41, 0x59, 0x4f, 0x20, 0x4e, 0x45, 0x57, 0x48, 0x41, 0x4c, 0x4c, 0x20, 0x4d, 0x45, 0x4d, 0x4f, 0x52, 0x49, 0x41, 0x4c, 0x20, 0x48, 0x4f, 0x53, 0x50))"
[5] "as.raw(c(0x48, 0x45, 0x4e, 0x52, 0x59, 0x20, 0x4d, 0x41, 0x59, 0x4f, 0x20, 0x4e, 0x45, 0x57, 0x48, 0x41, 0x4c, 0x4c, 0x20, 0x4d, 0x45, 0x4d, 0x4f, 0x52, 0x49, 0x41, 0x4c, 0x20, 0x48, 0x4f, 0x53, 0x50))"

感谢您为解决该问题所提供的帮助，或者提出了一些使我的要求更加明确的建议。 谢谢。

Answer 1

一个reprex会很好（仅适用于df），例如使用dput(head(df))并将结果粘贴到此处。 请尝试以下操作：

df %>% mutate(hospital_name = unlist(lapply(hospital_name, function(e) rawToChar(e))))

Answer 2

要解决该问题，请在读取实木复合地板文件之前为Spark Session配置设置spark.sql.parquet.binaryAsString属性：

sc$config$spark.sql.parquet.binaryAsString = TRUE

备注：就我而言，事实证明，因在IMPALA中插入而创建的拼花文件包含描述为“二进制”而不是“二进制UTF8”的“字符字段”。 在这种情况下，另一种解决方案是在插入数据之前在impala-shell中设置PARQUET_ANNOTATE_STRINGS_UTF8 ：

> set PARQUET_ANNOTATE_STRINGS_UTF8=1;
PARQUET_ANNOTATE_STRINGS_UTF8 set to 1

sparklyr spark_read_parquet以列表形式读取字符串字段

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-03-15 09:46:04

解决方案2
0 2018-07-03 13:57:36

sparklyr spark_read_parquet以列表形式读取字符串字段

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-03-15 09:46:04

解决方案2 0 2018-07-03 13:57:36

解决方案1
1 已采纳 2018-03-15 09:46:04

解决方案2
0 2018-07-03 13:57:36