简体   繁体   English

从列表列表中提取多列,然后保存在data.frame中

[英]Extract multiple columns from list of lists, and save in data.frame

I have the following list: 我有以下清单:

library(rjson)
j <- fromJSON(file='https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')

I am interested in extracting data from this component: j$response$docs , which is a list of lists. 我有兴趣从此组件中提取数据: j$response$docs ,它是一个列表列表。 The 'internal' lists are all supposed to have the same names. “内部”列表都应该具有相同的名称。

I want to save the output to a data.frame() or tibble() . 我想将输出保存到data.frame()data.frame() tibble()

This below works and gives the desired output, for the few selected variables: 下面的方法适用于少数选定的变量,并提供所需的输出:

nmod <- length(j$response$docs)
for (i in 1:nmod) {
    #select one list at a time
    j1 <- j$response$docs[[i]]
    tmp <- data.frame(variable=j1$variable,
                        variable_long_name=j1$variable_long_name,
                        rcm_name=j1$rcm_name,
                        driving_model=j1$driving_model,
                        cf_standard_name=j1$cf_standard_name
                        )
    #join them
    if (i==1) {
        d <- tmp
    } else {
        d <- rbind(d, tmp)
    }
}

However, I'd like to know if there is a more elegant and efficient way, maybe using tidyr , dplyr or purrr , which also would allow me to select all ¨columns¨, instead of just the few selected there. 但是,我想知道是否有一种更优雅,更有效的方法,也许使用tidyrdplyrpurrr ,这也将允许我选择所有“列”,而不仅仅是在那里选择的几个。

You can do it with help from package purrr . 您可以在purrr软件包的帮助下进行操作 I thought at_depth might work here, but instead I ended up using nested map_df . 我以为at_depth可能在这里at_depth ,但是我最终使用了嵌套的map_df

library(purrr)

Your variables are different lengths, so the first thing to do is to make sure each variable is length 1. This can be done by collapsing each element of the inner list with paste . 您的变量的长度是不同的,因此首先要做的是确保每个变量的长度都是1。这可以通过将内部列表的每个元素用paste折叠来完成。 I used commas a separator. 我用逗号分隔。 Doing this via map_df returns a 1 row tibble . 通过map_df执行此map_df将返回1行tibble

Here's an example with the first inner list. 这是第一个内部列表的示例。

map_df(j$response$docs[[1]], paste, collapse = ",")

Now we can loop through the outer lists, making a 1 row tibble for each. 现在我们可以遍历外部列表,为每个列表制作1行tibble We use map_df to bind each of these together. 我们使用map_df将每个绑定在一起。 The output is a 832 row tibble , one row per list. 输出是832行tibble ,每一个列表行。 I used the .id argument to add a grouping variable to the result, which may not be needed. 我使用.id参数将分组变量添加到结果中,这可能不是必需的。

d1 = map_df(j$response$docs, ~map_df(.x, paste, collapse = ","))
d1

# A tibble: 832 × 45
   group                                                                                                   id  version
   <chr>                                                                                                <chr>    <chr>
1      1   cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clh.v20131119|cordexesg.dmi.dk 20131119
2      2 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clivi.v20131119|cordexesg.dmi.dk 20131119
3      3  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsds.v20131119|cordexesg.dmi.dk 20131119
4      4  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlds.v20131119|cordexesg.dmi.dk 20131119
5      5  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsus.v20131119|cordexesg.dmi.dk 20131119
6      6  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlus.v20131119|cordexesg.dmi.dk 20131119
7      7  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsdt.v20131119|cordexesg.dmi.dk 20131119
8      8  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsut.v20131119|cordexesg.dmi.dk 20131119
9      9  cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlut.v20131119|cordexesg.dmi.dk 20131119
10    10   cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.psl.v20131119|cordexesg.dmi.dk 20131119
# ... with 822 more rows, and 42 more variables:

If you want to get multiple rows for the variables that were greater than length 1, such as access and experiment_family , you can use tidyr::separate_rows to separate the data onto multiple rows. 如果要获取大于长度1的变量的多行,例如accessexperiment_family ,则可以使用tidyr::separate_rows separate_rows将数据分成多行。

tidyr::separate_rows(d1, experiment_family)

instead of rjson go with this: 而不是rjson与此:

library(jsonlite)
j <- jsonlite::fromJSON('https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')

# The names you wan to find in the nested returned data
look_for <- c('variable','variable_long_name' ,
              'rcm_name','driving_model',
              'cf_standard_name')


new_df <- as.data.frame(sapply(look_for, function(i){
  unlist(j$response$docs[[i]])
}))

str(new_df)
'data.frame':   832 obs. of  5 variables:
$ variable          : chr  "clh" "clivi" "rsds" "rlds" ...
$ variable_long_name: chr  "High Level Cloud Fraction" "Ice Water Path" "Surface Downwelling Shortwave Radiation" "Surface Downwelling Longwave Radiation" ...
$ rcm_name          : chr  "HIRHAM5" "HIRHAM5" "HIRHAM5" "HIRHAM5" ...
$ driving_model     : chr  "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" ...
$ cf_standard_name  : chr  "cloud_area_fraction_in_atmosphere_layer" "atmosphere_cloud_ice_content" "surface_downwelling_shortwave_flux_in_air" "surface_downwelling_longwave_flux_in_air" ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM