[英]arrange contents in a dataframe extracted from docx in R
我有一個文檔(.docx),在下面的鏈接中找到,我使用 package 官員提取了內容。 https://1drv.ms/w/s?AmwfO49TqaeQhMVx-_pXn-9-3onRRw?e=oe782f
這是文檔的照片,標題為 1、2、3 不同 colors。
使用下面的代碼,我提取了該文檔的內容。
doc <- read_docx("test.docx")
content <- docx_summary(doc)
head(content)
#To get all paragraphs:
par_data <- subset(content, content_type %in% "paragraph")
par_data <- par_data[, c("doc_index", "style_name",
"text") ]
par_data$text <- with(par_data, {
substr(
text, start = 1,
stop = ifelse(nchar(text)<30, nchar(text), 30) )
})
par_data
可以使用以下代碼復制 dataframe。
par_data <- data.frame(doc_index = 1:21,
style_name = c("heading 1", "heading 2", "heading 3",NA ,NA,NA, "heading 2", "heading 3", NA,NA,NA, NA,"heading 2", "heading 3", NA, NA, "heading 1", "heading 2","heading 3", NA,NA ),
text = c(' Cardiovascular drugs ', ' ACE inhibitors. ', ' Valsartan ', ' Valsartan is used to treat hig ', ' Side effects ', ' high potassium; headache, dizz ', ' Beta blockers. ', ' propranolol ', ' Propranolol is prescribed for ', ' Side effects ', ' slow or uneven heartbeats', ' wheezing or trouble breathing ', ' Calcium channel blockers. ', ' Nifedipine ', ' Side effects ', ' Bloating or swelling of the fa ', ' Neurological drugs ', ' Anticonvulsants ', ' Phenytoin ', ' Side effects ', ' Decreased coordination, mental '))
我需要的是重塑這個 dataframe 有這樣的東西:
事實上,我需要標題 1 和 2 作為列,其中每種葯物(都是標題 3)獲取這些列中最后一個標題的文本。 另外,我還需要另外兩列。 有些葯物有描述,然后是副作用,而另一些葯物只有副作用,在下一個標題 1 或 2 或 3 出現之前的行中。 有沒有一種簡單的方法可以做到這一點? 任何幫助表示贊賞。
這不僅僅是重塑,需要基於以前的text
和style_name
值進行一些推斷,加上“最后一次觀察結轉”(locf)。 數據在字符串的開頭/結尾也有空格,所以我會用trimws
清理它們。
我認為這可以滿足您的要求:
library(dplyr)
# library(tidyr) # fill
par_data %>%
mutate(across(where(is.character), trimws)) %>%
mutate(
grp = cumsum(is.na(lag(style_name)) & !is.na(style_name)),
style_name = case_when(
is.na(style_name) & lag(text) == "Side effects" ~ "sideeffects",
is.na(style_name) & lag(style_name) == "heading 3" &
!text %in% "Side effects" ~ "description",
TRUE ~ style_name)
) %>%
filter(!is.na(style_name)) %>%
pivot_wider(grp, names_from = "style_name", values_from = "text") %>%
tidyr::fill(`heading 1`)
# # A tibble: 4 x 6
# grp `heading 1` `heading 2` `heading 3` description sideeffects
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 Cardiovascular drugs ACE inhibitors. Valsartan Valsartan is used to treat hig high potas~
# 2 2 Cardiovascular drugs Beta blockers. propranolol Propranolol is prescribed for slow or un~
# 3 3 Cardiovascular drugs Calcium channel blockers. Nifedipine NA Bloating o~
# 4 4 Neurological drugs Anticonvulsants Phenytoin NA Decreased ~
這可以在 tidyverse 之外完成,盡管它仍然會受益於外部 package function ( reshape2::dcast
)... stats::reshape
可能有點麻煩。
如果您已經在使用(或考慮使用) data.table
,則大致相當於上述內容:
library(data.table)
chrs <- which(sapply(par_data, is.character))
as.data.table(par_data)[, c(chrs) := lapply(.SD, trimws), .SDcols = chrs
][, grp := cumsum(is.na(shift(style_name)) & !is.na(style_name))
][, style_name := fcase(
is.na(style_name) & shift(text) == "Side effects", "sideeffects",
is.na(style_name) & lag(style_name) == "heading 3" &
!text %in% "Side effects", "description",
rep(TRUE, .N), style_name)
][!is.na(style_name),
][, dcast(grp ~ style_name, value.var = "text", data = .SD)
][, `heading 1` := zoo::na.locf(`heading 1`)
][, .(`heading 1`, `heading 2`, `heading 3`, description, sideeffects) ]
# heading 1 heading 2 heading 3 description sideeffects
# 1: Cardiovascular drugs ACE inhibitors. Valsartan Valsartan is used to treat hig high potassium; headache, dizz
# 2: Cardiovascular drugs Beta blockers. propranolol Propranolol is prescribed for slow or uneven heartbeats
# 3: Cardiovascular drugs Calcium channel blockers. Nifedipine <NA> Bloating or swelling of the fa
# 4: Neurological drugs Anticonvulsants Phenytoin <NA> Decreased coordination, mental
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.