簡體   English   中英

排列從 R 中的 docx 中提取的 dataframe 中的內容

[英]arrange contents in a dataframe extracted from docx in R

我有一個文檔(.docx),在下面的鏈接中找到,我使用 package 官員提取了內容。 https://1drv.ms/w/s?AmwfO49TqaeQhMVx-_pXn-9-3onRRw?e=oe782f

這是文檔的照片,標題為 1、2、3 不同 colors。

在此處輸入圖像描述

使用下面的代碼,我提取了該文檔的內容。

doc <- read_docx("test.docx")
content <- docx_summary(doc)
head(content)

#To get all paragraphs:
par_data <- subset(content, content_type %in% "paragraph") 
par_data <- par_data[, c("doc_index", "style_name", 
                         "text") ]
par_data$text <- with(par_data, {
  substr(
    text, start = 1, 
    stop = ifelse(nchar(text)<30, nchar(text), 30) )
})
par_data

可以使用以下代碼復制 dataframe。

par_data <- data.frame(doc_index = 1:21, 
                   style_name = c("heading 1", "heading 2", "heading 3",NA ,NA,NA, "heading 2", "heading 3", NA,NA,NA, NA,"heading 2", "heading 3", NA, NA, "heading 1", "heading 2","heading 3", NA,NA ), 
                   text = c(' Cardiovascular drugs ', ' ACE inhibitors. ', ' Valsartan ', ' Valsartan is used to treat hig ', ' Side effects ', ' high potassium; headache, dizz ', ' Beta blockers. ', ' propranolol ', ' Propranolol is prescribed for  ', ' Side effects ', ' slow or uneven heartbeats', ' wheezing or trouble breathing ', ' Calcium channel blockers. ', ' Nifedipine ', ' Side effects ', ' Bloating or swelling of the fa ', ' Neurological drugs ', ' Anticonvulsants ', ' Phenytoin  ', ' Side effects ', ' Decreased coordination, mental '))

我需要的是重塑這個 dataframe 有這樣的東西:

在此處輸入圖像描述

事實上,我需要標題 1 和 2 作為列,其中每種葯物(都是標題 3)獲取這些列中最后一個標題的文本。 另外,我還需要另外兩列。 有些葯物有描述,然后是副作用,而另一些葯物只有副作用,在下一個標題 1 或 2 或 3 出現之前的行中。 有沒有一種簡單的方法可以做到這一點? 任何幫助表示贊賞。

這不僅僅是重塑,需要基於以前的textstyle_name值進行一些推斷,加上“最后一次觀察結轉”(locf)。 數據在字符串的開頭/結尾也有空格,所以我會用trimws清理它們。

dplyr

我認為這可以滿足您的要求:

library(dplyr)
# library(tidyr) # fill
par_data %>%
  mutate(across(where(is.character), trimws)) %>%
  mutate(
    grp = cumsum(is.na(lag(style_name)) & !is.na(style_name)),
    style_name = case_when(
      is.na(style_name) & lag(text) == "Side effects" ~ "sideeffects",
      is.na(style_name) & lag(style_name) == "heading 3" &
        !text %in% "Side effects" ~ "description",
      TRUE ~ style_name)
  ) %>%
  filter(!is.na(style_name)) %>%
  pivot_wider(grp, names_from = "style_name", values_from = "text") %>%
  tidyr::fill(`heading 1`)
# # A tibble: 4 x 6
#     grp `heading 1`          `heading 2`               `heading 3` description                    sideeffects
#   <int> <chr>                <chr>                     <chr>       <chr>                          <chr>      
# 1     1 Cardiovascular drugs ACE inhibitors.           Valsartan   Valsartan is used to treat hig high potas~
# 2     2 Cardiovascular drugs Beta blockers.            propranolol Propranolol is prescribed for  slow or un~
# 3     3 Cardiovascular drugs Calcium channel blockers. Nifedipine  NA                             Bloating o~
# 4     4 Neurological drugs   Anticonvulsants           Phenytoin   NA                             Decreased ~

可以在 tidyverse 之外完成,盡管它仍然會受益於外部 package function ( reshape2::dcast )... stats::reshape可能有點麻煩。

data.table

如果您已經在使用(或考慮使用) data.table ,則大致相當於上述內容:

library(data.table)
chrs <- which(sapply(par_data, is.character))
as.data.table(par_data)[, c(chrs) := lapply(.SD, trimws), .SDcols = chrs
  ][, grp := cumsum(is.na(shift(style_name)) & !is.na(style_name))
    ][, style_name := fcase(
        is.na(style_name) & shift(text) == "Side effects", "sideeffects",
        is.na(style_name) & lag(style_name) == "heading 3" &
          !text %in% "Side effects", "description",
        rep(TRUE, .N),  style_name)
      ][!is.na(style_name),
        ][, dcast(grp ~ style_name, value.var = "text", data = .SD)
          ][, `heading 1` := zoo::na.locf(`heading 1`)
            ][, .(`heading 1`, `heading 2`, `heading 3`, description, sideeffects) ]
#               heading 1                 heading 2   heading 3                    description                    sideeffects
# 1: Cardiovascular drugs           ACE inhibitors.   Valsartan Valsartan is used to treat hig high potassium; headache, dizz
# 2: Cardiovascular drugs            Beta blockers. propranolol  Propranolol is prescribed for      slow or uneven heartbeats
# 3: Cardiovascular drugs Calcium channel blockers.  Nifedipine                           <NA> Bloating or swelling of the fa
# 4:   Neurological drugs           Anticonvulsants   Phenytoin                           <NA> Decreased coordination, mental

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM