[英]R Spread Data Frame based on row names
I have a Data Frame with two columns. 我有两列的数据框。 Row names are duplicated as the data domes from a list of reports with some common fields. 行名称被复制为带有某些公用字段的报告列表中的数据穹顶。 Each report contains a different number of fields. 每个报告包含不同数量的字段。 I want to spread this data frame into multiple columns based on one of these duplicated row names. 我想基于这些重复的行名称之一将此数据帧扩展为多列。 The end result would have each report in a row. 最终结果将使每个报告连续显示。
These reports come from an API that exists on a system at work. 这些报告来自工作系统上存在的API。 It returns a very nested JSON. 它返回一个非常嵌套的JSON。 I wanted to see if getting the data in to this format would provide me a way to clean up the data. 我想看看以这种格式获取数据是否可以为我提供一种清理数据的方法。
Minimal Example of Data 最小数据示例
Column1 Column2
contentID 123
value1 California
value2 truck
value3 home
contentID 897
value1 Georgia
value2 car
value3 work
value4 boeing
contentID 537
value2 truck
value4 private
value5 first class
value6 wheels
Desired outcome 期望的结果
ContentID value1 value2 value3 value4 value5 value6
123 California truck home NA NA NA
897 Georgia car work boeing NA NA
537 NA truck NA private firstclass wheels
One tidyverse
possibility could be: tidyverse
一种可能是:
df %>%
mutate(id = cumsum(grepl("content", Column1))) %>%
group_by(id) %>%
mutate(ContentID = first(Column2)) %>%
filter(!grepl("content", Column1)) %>%
ungroup() %>%
select(-id) %>%
spread(Column1, Column2)
ContentID value1 value2 value3 value4 value5 value6
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 123 California truck home <NA> <NA> <NA>
2 537 <NA> truck <NA> private first_class wheels
3 897 Georgia car work boeing <NA> <NA>
Here it, first, creates an ID variable based on the occurrence of content
in "Column1" and groups by it. 首先,它根据“ Column1”中content
的出现创建一个ID变量并对其进行分组。 Second, it creates a "ContentID" variable with the values from the first row on "Column2" per group. 其次,它使用每个组“ Column2”上第一行的值创建一个“ ContentID”变量。 Third, it filters out the rows that contains content
in "Column1". 第三,它过滤掉包含“ Column1”中content
的行。 Finally, it spreads the data. 最后,它传播数据。
You can simply do this- 您可以简单地做到这一点-
library(data.table)
library(zoo)
setDT(dt)
dt[,id:=ifelse(Column1 %like% "contentID",paste(Column2),NA)]
dt[,id:=na.locf(id)]
dcast.data.table(dt,id~Column1,value.var="Column2",subset = .(Column1!="contentID"))
id value1 value2 value3 value4 value5 value6
1: 123 California truck home <NA> <NA> <NA>
2: 537 <NA> truck <NA> private firstclass wheels
3: 897 Georgia car work boeing <NA> <NA>
Note - It will be efficient if you have large dataset. 注意 -如果数据集很大,这将非常有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.