[英]Regex in R: how to fill dataframe with multiple matches to left and right of target string
我想在目標詞(節點)的左側和右側提取單詞組合(並置)並將三個元素存儲在數據幀中。
數據:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
目標:
目標詞是動詞 GO 在其任何可能的實現中,無論是“go”、“going”、gos、“gone”還是“went”,我有興趣提取 GO 左側的3 個詞並GO的權利。 這三個詞可以跨越句子邊界,但提取的字符串不應包含標點符號。
到目前為止我嘗試過的:
為了提取左手並置我用str_extract_all
從stringr
:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
這將捕獲大多數但不是所有匹配項並包括空格。 相比之下,節點的提取看起來沒問題:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
提取右手搭配:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
再次匹配是不完整的,並且包含不需要的空格。 最后組裝數據幀中的所有匹配項會引發錯誤:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
預期輸出:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
有誰知道如何解決這一問題? 非常感謝建議。
如果使用Quanteda,可以得到如下結果。 當您處理文本時,您希望使用小寫字母。 我用tolower()
轉換了大寫字母。 我也刪除了.
並且,
使用gsub()
。 然后,我將kwic()
應用於文本。 如果您不介意丟失大寫字母、點和逗號,您幾乎可以得到您想要的。
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>
對於對未注釋文本進行搭配研究的后代或同時代人來說有點晚但還不算晚,這是我自己對我的問題的回答。 將全部功勞給@ jazzurro的指針quanteda
他的回答。
我的問題是:如何計算在一個文本給定節點的搭配詞,並存儲在數據幀的結果(這不是解決由@jazzurro的部分)。
數據:
GO <- c("This little sentence went on and went on. It was going on for quite a while.
Going on for ages. It's still going on. And will go on and on, and go on forever.")
第 1 步:准備數據進行分析
go <- gsub("[.!?;,:]", "", tolower(GO)) # get rid of punctuation
go <- gsub("'", " ", tolower(go)) # separate clitics from host
第 2 步:使用正則表達式模式和參數valuetype = "regex"
提取 KWIC
concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord
[text1, 4] this little sentence | went | on and went
[text1, 7] went on and | went | on it was
[text1, 11] on it was | going | on for quite
[text1, 17] quite a while | going | on for ages
[text1, 24] it s still | going | on and will
[text1, 28] on and will | go | on and on
[text1, 33] and on and | go | on forever
第 3 步:識別比 window 定義的搭配更少的字符串:
# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3 # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2 # last string has only two collocates
第 4 步:將 NA 添加到缺少搭配的字符串中:
# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")
步驟5:填充數據幀與搭配詞和節點槽,使用str_extract
從文庫stringr
以及與lookarounds來確定分割點為搭配詞的正則表達式:
library(stringr)
L3toR3 <- data.frame(
L3 = str_extract(concord$pre, "^\\w+\\b"),
L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
L1 = str_extract(concord$pre, "\\w+\\b$"),
Node = concord$keyword,
R1 = str_extract(concord$post, "^\\w+\\b"),
R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
R3 = str_extract(concord$post, "\\w+\\b$")
)
結果:
L3toR3
L3 L2 L1 Node R1 R2 R3
1 this little sentence went on and went
2 went on and went on it was
3 on it was going on for quite
4 quite a while going on for ages
5 it s still going on and will
6 on and will go on and on
7 and on and go on forever NA
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.