![](/img/trans.png)
[英]How to merge rows in one column to match non-empty rows in the other column?
[英]R- Merge variable number of rows in multiple columns based on non-empty rows in other column
我使用 extract_tables() 從 PDF 文件中提取了一個表,但文本已分散在多行中。 每條記錄的行數不同。 我想將文本組合成一個值。
我想做的與這篇文章類似。 不同之處在於我在多列中有文本。 每個條目使用的記錄數是可變的,取決於每次不同的列。
示例:一個條目可能占用四行,因為“名稱和位置”列分布在四行中(而其他列僅占用該條目的兩行;rest 填充有 NA)。 對於另一個條目,由於“專業知識”列中文本的長度,文本可能會分布在 6 行中。
每當“級別”列包含一個值而不是 NA 時,就會開始一條新記錄。 編輯: “級別”值是非唯一的
我的數據如下所示:
Name & location Expertise Type Sector Payment Level
1: Ms. Jane Student Higher Government and payment 1
2: Doe, <NA> Education education has been <NA>
3: NUS <NA> institute <NA> received <NA>
4: Andrew Saunders Phd., Chief Municipal Government and payment 5
5: Municipality of Education government education has not <NA>
6: Amsterdam Officer <NA> <NA> been <NA>
7: <NA> <NA> <NA> <NA> received <NA>
8: Mr. Stephen Spokesperson for Municipal Government and payment 3
9: Johnson, Sustainability, government education has not <NA>
10: Orange County Health & <NA> <NA> been <NA>
11: <NA> Wellbeing and <NA> <NA> received <NA>
12: <NA> Wellfare <NA> <NA> <NA> <NA>
13: Mrs. Susan Junior national Government and payment 4
14: Andrews, Research government education has not <NA>
15: Police Manager <NA> <NA> been <NA>
16: <NA> Money <NA> <NA> received <NA>
17: <NA> Laundering <NA> <NA> <NA> <NA>
可重現的例子:
structure(list(`Name & location` = c("1: Ms. Jane", "2: Doe,",
"3: NUS", "4: Andrew Saunders Phd.,", "5: Municipality of",
"6: Amsterdam", "7: <NA>", "8: Mr. Stephen", "9: Johnson,",
"10: Orange County", "11: <NA>", "12: <NA>", "13: Mrs. Susan",
"14: Andrews,", "15: Police", "16: <NA>", "17: <NA>"),
Expertise = c("Student", NA, NA, "Chief", "Education", "Officer",
NA, "Spokesperson for", "Sustainability,", "Health &", "Wellbeing and",
"Wellfare", "Junior", "Research", "Manager", "Money", "Laundering"
), Type = c("Higher", "Education", "Insititute", "Municipal",
"Government", NA, NA, "Municipal", "Government", NA, NA,
NA, "National", "Government", NA, NA, NA), Sector = c("Government and",
"education", NA, "Government and", "education", NA, NA, "Government and",
"education", NA, NA, NA, "Government and", "education", NA,
NA, NA), Payment = c("payment", "has been", "received", "Payment",
"has not", "been", "received", "Payment", "has not", "been",
"received", NA, "Payment", "has not", "been", "received",
NA), Level = c(1, NA, NA, 5, NA, NA, NA, 3, NA, NA, NA, NA,
4, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))
到目前為止我嘗試的是下面代碼的不同版本
DF_clean <- DF %>% mutate(Level = ifelse(grepl(NA, Level))) %>%
group_by(id = cumsum(!is.na(Level))) %>%
mutate(Level = first(Level)) %>%
group_by(Level) %>%
summarise(Name = paste(Name, collapse = " "),
Expertise = paste(Expertise, collapse = " "),
Type = paste(Type, collapse = " "),
Sector = paste(Sector, collapse = " "),
Level = paste(Level, collapse = " "))
但這似乎將所有文本折疊成一條記錄。
關於如何解決這個問題的任何想法?
肯定有一些更漂亮的解決方案,但這似乎有效。 如果Level
包含重復值,它也可以工作。
# Remove row numbers and <NA> from Name & Location
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`))
# Compute ranges to merge
starts <- c(which(!is.na(df$Level)), nrow(df) + 1)
ranges <- sapply(
1:(length(starts) - 1),
function(x)
starts[x]:(starts[x + 1] - 1)
)
# Merge lines based on ranges
combined_df <- lapply(
ranges,
function(x)
lapply(df[x, ], function(x) gsub(" +$| NA", "", paste0(x, collapse = " ")))
) %>%
bind_rows
# A tibble: 4 x 6
`Name & location` Expertise Type Sector Payment Level
<chr> <chr> <chr> <chr> <chr> <chr>
1 Ms. Jane Doe, NUS Student Higher Education Insititute Government and education payment has been received 1
2 Andrew Saunders Phd., Municipality of Amsterdam Chief Education Officer Municipal Government Government and education Payment has not been received 5
3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health & Wellbeing and Wellfare Municipal Government Government and education Payment has not been received 3
4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Government Government and education Payment has not been received 4
編輯:我使用@Andrew 的解決方案來計算一個新的unique_level
列並使其工作。 它比我的第一個解決方案恕我直言:
library(tidyverse)
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`)) %>%
mutate(unique_level = ifelse(!is.na(Level), 1, NA) * 1:nrow(df)) %>%
fill(unique_level, .direction = "down") %>%
group_by(unique_level) %>%
summarise_all(~ gsub(" +$| NA", "", paste(., collapse = " "))) %>%
select(-unique_level)
前兩個mutate
調用從Name & location
列中刪除行號和<NA>
。 summarise_all
中的gsub
調用會刪除尾隨空格,並在將行粘貼在一起時添加NA
。
編輯:
在這里,這會稍微清理一下,並且也適用於非 unqiue 級別。 您還需要安裝data.table
,因為我使用rleid
創建一個新的級別變量(假設可以覆蓋它並丟失實際級別值)。 如果您需要保留原始級別,只需創建一個新的 rleid 級別列並按此分組。 如果您有任何問題,請告訴我!
df1 %>%
fill(Level, .direction = "down") %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+(<NA>)*", "", `Name & location`)) %>%
replace(is.na(.), "") %>%
group_by(Level = data.table::rleid(Level)) %>%
summarise_all(~trimws(paste(., collapse = " ")
Level `Name & location` Expertise Type Sector Payment
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 Ms. Jane Doe, NUS Student Higher Education~ Government and ~ payment has been r~
2 2 Andrew Saunders Phd., Municipalit~ Chief Education Officer Municipal Govern~ Government and ~ Payment has not be~
3 3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health ~ Municipal Govern~ Government and ~ Payment has not be~
4 4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Governm~ Government and ~ Payment has not be~
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.