R-根據其他列中的非空行合並多列中的可變行數

Question

我使用 extract_tables() 從 PDF 文件中提取了一個表，但文本已分散在多行中。 每條記錄的行數不同。 我想將文本組合成一個值。

我想做的與這篇文章類似。 不同之處在於我在多列中有文本。 每個條目使用的記錄數是可變的，取決於每次不同的列。

示例：一個條目可能占用四行，因為“名稱和位置”列分布在四行中（而其他列僅占用該條目的兩行；rest 填充有 NA）。 對於另一個條目，由於“專業知識”列中文本的長度，文本可能會分布在 6 行中。

每當“級別”列包含一個值而不是 NA 時，就會開始一條新記錄。 編輯： “級別”值是非唯一的

我的數據如下所示：

Name & location                 Expertise           Type            Sector               Payment            Level
 1:   Ms. Jane                  Student             Higher          Government and       payment               1
 2:   Doe,                      <NA>                Education       education            has been           <NA>
 3:   NUS                       <NA>                institute       <NA>                 received           <NA>
 4:   Andrew Saunders Phd.,     Chief               Municipal       Government and       payment               5
 5:   Municipality of           Education           government      education            has not            <NA>
 6:   Amsterdam                 Officer             <NA>            <NA>                 been               <NA>
 7:   <NA>                      <NA>                <NA>            <NA>                 received           <NA>
 8:   Mr. Stephen               Spokesperson for    Municipal       Government and       payment               3
 9:   Johnson,                  Sustainability,     government      education            has not            <NA>
10:   Orange County             Health &            <NA>            <NA>                 been               <NA>
11:   <NA>                      Wellbeing and       <NA>            <NA>                 received           <NA>
12:   <NA>                      Wellfare            <NA>            <NA>                 <NA>               <NA>
13:   Mrs. Susan                Junior              national        Government and       payment               4
14:   Andrews,                  Research            government      education            has not            <NA>
15:   Police                    Manager             <NA>            <NA>                 been               <NA>
16:   <NA>                      Money               <NA>            <NA>                 received           <NA>
17:   <NA>                      Laundering          <NA>            <NA>                 <NA>               <NA>

可重現的例子：

structure(list(`Name & location` = c("1:   Ms. Jane", "2:   Doe,", 
"3:   NUS", "4:   Andrew Saunders Phd.,", "5:   Municipality of", 
"6:   Amsterdam", "7:   <NA>", "8:   Mr. Stephen", "9:   Johnson,", 
"10:   Orange County", "11:   <NA>", "12:   <NA>", "13:   Mrs. Susan", 
"14:   Andrews,", "15:   Police", "16:   <NA>", "17:   <NA>"), 
    Expertise = c("Student", NA, NA, "Chief", "Education", "Officer", 
    NA, "Spokesperson for", "Sustainability,", "Health &", "Wellbeing and", 
    "Wellfare", "Junior", "Research", "Manager", "Money", "Laundering"
    ), Type = c("Higher", "Education", "Insititute", "Municipal", 
    "Government", NA, NA, "Municipal", "Government", NA, NA, 
    NA, "National", "Government", NA, NA, NA), Sector = c("Government and", 
    "education", NA, "Government and", "education", NA, NA, "Government and", 
    "education", NA, NA, NA, "Government and", "education", NA, 
    NA, NA), Payment = c("payment", "has been", "received", "Payment", 
    "has not", "been", "received", "Payment", "has not", "been", 
    "received", NA, "Payment", "has not", "been", "received", 
    NA), Level = c(1, NA, NA, 5, NA, NA, NA, 3, NA, NA, NA, NA, 
    4, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df", 
"tbl", "data.frame"))

到目前為止我嘗試的是下面代碼的不同版本

DF_clean <- DF %>% mutate(Level = ifelse(grepl(NA, Level))) %>%
  group_by(id = cumsum(!is.na(Level))) %>% 
  mutate(Level = first(Level)) %>% 
  group_by(Level) %>% 
  summarise(Name = paste(Name, collapse = " "),
            Expertise = paste(Expertise, collapse = " "),
            Type = paste(Type, collapse = " "),
            Sector = paste(Sector, collapse = " "),
            Level = paste(Level, collapse = " "))

但這似乎將所有文本折疊成一條記錄。

關於如何解決這個問題的任何想法？

Answer 1

肯定有一些更漂亮的解決方案，但這似乎有效。 如果Level包含重復值，它也可以工作。

# Remove row numbers and <NA> from Name & Location
df <- df %>%
  mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
  mutate(`Name & location` = gsub("<NA>", "", `Name & location`))

# Compute ranges to merge
starts <- c(which(!is.na(df$Level)), nrow(df) + 1)
ranges <- sapply(
  1:(length(starts) - 1), 
  function(x) 
    starts[x]:(starts[x + 1] - 1)
)

# Merge lines based on ranges
combined_df <- lapply(
  ranges,
  function(x)
    lapply(df[x, ], function(x) gsub(" +$| NA", "", paste0(x, collapse = " ")))
) %>%
  bind_rows


# A tibble: 4 x 6
  `Name & location`                               Expertise                                                        Type                        Sector                   Payment                       Level
  <chr>                                           <chr>                                                            <chr>                       <chr>                    <chr>                         <chr>
1 Ms. Jane Doe, NUS                               Student                                                          Higher Education Insititute Government and education payment has been received     1    
2 Andrew Saunders Phd., Municipality of Amsterdam Chief Education Officer                                          Municipal Government        Government and education Payment has not been received 5    
3 Mr. Stephen Johnson, Orange County              Spokesperson for Sustainability, Health & Wellbeing and Wellfare Municipal Government        Government and education Payment has not been received 3    
4 Mrs. Susan Andrews, Police                      Junior Research Manager Money Laundering                         National Government         Government and education Payment has not been received 4

編輯：我使用@Andrew 的解決方案來計算一個新的unique_level列並使其工作。 它比我的第一個解決方案恕我直言：

library(tidyverse)

df <- df %>%
  mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
  mutate(`Name & location` = gsub("<NA>", "", `Name & location`)) %>%
  mutate(unique_level = ifelse(!is.na(Level), 1, NA) * 1:nrow(df)) %>%
  fill(unique_level, .direction = "down") %>%
  group_by(unique_level) %>%
  summarise_all(~ gsub(" +$| NA", "", paste(., collapse = " "))) %>%
  select(-unique_level)

前兩個mutate調用從Name & location列中刪除行號和<NA> 。 summarise_all中的gsub調用會刪除尾隨空格，並在將行粘貼在一起時添加NA 。

Answer 2

編輯：

在這里，這會稍微清理一下，並且也適用於非 unqiue 級別。 您還需要安裝data.table ，因為我使用rleid創建一個新的級別變量（假設可以覆蓋它並丟失實際級別值）。 如果您需要保留原始級別，只需創建一個新的 rleid 級別列並按此分組。 如果您有任何問題，請告訴我！

df1 %>%
  fill(Level, .direction = "down") %>%
  mutate(`Name & location` = gsub("[0-9]+:\\s+(<NA>)*", "", `Name & location`)) %>%
  replace(is.na(.), "") %>%
  group_by(Level = data.table::rleid(Level)) %>%
  summarise_all(~trimws(paste(., collapse = " ") 

Level `Name & location`                  Expertise                                 Type              Sector           Payment            
  <chr> <chr>                              <chr>                                     <chr>             <chr>            <chr>              
1 1     Ms. Jane Doe, NUS                  Student                                   Higher Education~ Government and ~ payment has been r~
2 2     Andrew Saunders Phd., Municipalit~ Chief Education Officer                   Municipal Govern~ Government and ~ Payment has not be~
3 3     Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health ~ Municipal Govern~ Government and ~ Payment has not be~
4 4     Mrs. Susan Andrews, Police         Junior Research Manager Money Laundering  National Governm~ Government and ~ Payment has not be~

R-根據其他列中的非空行合並多列中的可變行數

問題描述

2 個解決方案

解決方案1
3 已采納 2019-10-17 14:08:37

解決方案2
2 2019-10-17 14:09:17

R-根據其他列中的非空行合並多列中的可變行數

問題描述

2 個解決方案

解決方案1 3 已采納 2019-10-17 14:08:37

解決方案2 2 2019-10-17 14:09:17

解決方案1
3 已采納 2019-10-17 14:08:37

解決方案2
2 2019-10-17 14:09:17