[英]concatenate iteratively first/last n words from strings
假設我有以下 data.frame:
df <- data.frame(string=c("word1 word2 word3 word4", "word1 word2", "word1"), stringsAsFactors = FALSE)
我想在列表(或每行)中導出第一個/最后 n 個單詞(n 從 1 到單詞數)的連接。 預期結果:
list(
string1=c('left1'="word1", 'left2'= "word1 word2", 'left3'="word1 word2 word3",
'left4'="word1 word2 word3 word4",
'right1'="word4", 'right2'="word3 word4", 'right3'="word2 word3 word4"),
string2= c('left1'="word1", 'left2'="word1 word2", 'right1'="word2"),
string3="word1")
(根本不需要元素的名稱,但有助於理解)。
不需要:粘貼中間元素,例如“word2 word3”。
我目前使用strsplit(df$string)
來准備所需列表的第一步,然后可以用雙循環實現我想要的,但這遠非有效。
在基本 R / data.table 中首選方法,但 tidyverse 有效的解決方案會很不錯。
基本 R 版本:
我們可以編寫一個函數,每次遞增地粘貼每個單詞的值。
paste_words <- function(x) {
sapply(seq_along(x), function(y) paste0(x[1:y], collapse = " "))
}
lapply(strsplit(df$string, " "), function(x) c(paste_words(x), paste_words(rev(x))))
#[[1]]
#[1] "word1" "word1 word2" "word1 word2 word3" "word1 word2 word3 word4"
#[5] "word4" "word4 word3" "word4 word3 word2" "word4 word3 word2 word1"
#[[2]]
#[1] "word1" "word1 word2" "word2" "word2 word1"
#[[3]]
#[1] "word1" "word1"
您可能想要包裝unique
以避免重復類似的單詞,如最后一個元素。
一個dplyr
, tidyr
和purrr
選項可以是:
df %>%
rowid_to_column() %>%
separate_rows(string, sep = " ") %>%
group_by(rowid) %>%
transmute(concatenated = accumulate(string, ~ paste(.x, .y)),
concatenated_rev = accumulate(rev(string), ~ paste(.x, .y)))
rowid concatenated concatenated_rev
<int> <chr> <chr>
1 1 word1 word4
2 1 word1 word2 word4 word3
3 1 word1 word2 word3 word4 word3 word2
4 1 word1 word2 word3 word4 word4 word3 word2 word1
5 2 word1 word2
6 2 word1 word2 word2 word1
7 3 word1 word1
或進一步的左/右信息:
df %>%
rowid_to_column() %>%
separate_rows(string, sep = " ") %>%
group_by(rowid) %>%
transmute(left = paste0("left", 1:n()),
concatenated = accumulate(string, ~ paste(.x, .y)),
right = paste0("right", 1:n()),
concatenated_rev = accumulate(rev(string), ~ paste(.x, .y)))
rowid left concatenated right concatenated_rev
<int> <chr> <chr> <chr> <chr>
1 1 left1 word1 right1 word4
2 1 left2 word1 word2 right2 word4 word3
3 1 left3 word1 word2 word3 right3 word4 word3 word2
4 1 left4 word1 word2 word3 word4 right4 word4 word3 word2 word1
5 2 left1 word1 right1 word2
6 2 left2 word1 word2 right2 word2 word1
7 3 left1 word1 right1 word1
感謝 Ronak 方法(謝謝),我最終得到了以下代碼。 比我的循環更優雅、更高效。
paste_words_left <- function(x) {
sapply(seq_along(x), function(y) paste0(x[1:y], collapse = " "))
}
paste_words_right <- function(x) {
sapply(seq_along(x)[-1], function(y) paste0(x[y:length(x)], collapse = " "))
}
## lapply(strsplit(df$string, " "), function(x) c(paste_words_left(x), paste_words_right(x)))
lapply(strsplit(df$string, " "), function(x){
if (length(x)==1) x else c(paste_words_left(x), paste_words_right(x))})
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.