簡體   English   中英

R中的正則表達式:如何使用目標字符串左右兩側的多個匹配項填充數據框

[英]Regex in R: how to fill dataframe with multiple matches to left and right of target string

(這是R 中 Regex的后續:匹配節點詞的搭配。)

我想在目標詞(節點)的左側和右側提取單詞組合(並置)並將三個元素存儲在數據幀中。

數據

GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")

目標

目標詞是動詞 GO 在其任何可能的實現中,無論是“go”、“going”、gos、“gone”還是“went”,我有興趣提取 GO 左側的3 個詞並GO的權利。 這三個詞可以跨越句子邊界,但提取的字符串不應包含標點符號。

到目前為止我嘗試過的

為了提取左手並置我用str_extract_allstringr

unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and"         " It was"              "s still"             
[5] " And will"            " and"

這將捕獲大多數但不是所有匹配項並包括空格。 相比之下,節點的提取看起來沒問題:

unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went"  "went"  "going" "Going" "going" "go"    "go"

提取右手搭配:

unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went"  " on"           " on for quite" " on for ages"  " on"           " on and on"   
[7] " on forever"

再次匹配是不完整的,並且包含不需要的空格。 最后組裝數據幀中的所有匹配項會引發錯誤:

collocates <- data.frame(
  Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
  Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
  Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),  : 
      arguments imply differing number of rows: 6, 7

預期輸出

Left                    Node    Right
This little sentence    went    on and went
went on and             went    on It was
on It was              going    on for quite
quite a while          Going    on for ages
ages It’s still        going    on And will
on And will               go    on and on
and on and                go    on forever

有誰知道如何解決這一問題? 非常感謝建議。

如果使用Quanteda,可以得到如下結果。 當您處理文本時,您希望使用小寫字母。 我用tolower()轉換了大寫字母。 我也刪除了. 並且,使用gsub() 然后,我將kwic()應用於文本。 如果您不介意丟失大寫字母、點和逗號,您幾乎可以得到您想要的。

library(quanteda)
library(dplyr)
library(splitstackshape)

myvec <- c("go", "going", "goes", "gone", "went")

mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")

mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>% 
        as_tibble %>%
        select(pre, keyword, post) %>% 
        cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>% 
        select(contains("pre"), keyword, contains("post"))

   pre_1  pre_2    pre_3 keyword post_1  post_2 post_3
1:  this little sentence    went     on     and   went
2:  went     on      and    went     on      it    was
3:    on     it      was   going     on     for  quite
4: quite      a    while   going     on     for   ages
5:  ages   it's    still   going     on     and   will
6:    on    and     will      go     on     and     on
7:   and     on      and      go     on forever   <NA>

對於對未注釋文本進行搭配研究的后代或同時代人來說有點晚但還不算晚,這是我自己對我的問題的回答。 將全部功勞給@ jazzurro的指針quanteda他的回答。

我的問題是:如何計算在一個文本給定節點的搭配詞,並存儲在數據幀的結果(這不是解決由@jazzurro的部分)。

數據

GO <- c("This little sentence went on and went on. It was going on for quite a while. 
    Going on for ages. It's still going on. And will go on and on, and go on forever.")

第 1 步:准備數據進行分析

go <- gsub("[.!?;,:]", "", tolower(GO))   # get rid of punctuation
go <- gsub("'", " ", tolower(go))         # separate clitics from host

第 2 步:使用正則表達式模式和參數valuetype = "regex"提取 KWIC

concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord                                                       
  [text1, 4] this little sentence | went  | on and went 
  [text1, 7]          went on and | went  | on it was   
 [text1, 11]            on it was | going | on for quite
 [text1, 17]        quite a while | going | on for ages 
 [text1, 24]           it s still | going | on and will 
 [text1, 28]          on and will |  go   | on and on   
 [text1, 33]           and on and |  go   | on forever  

第 3 步:識別比 window 定義的搭配更少的字符串:

# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3   # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2   # last string has only two collocates

第 4 步:將 NA 添加到缺少搭配的字符串中:

# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")

步驟5:填充數據幀與搭配詞和節點槽,使用str_extract從文庫stringr以及與lookarounds來確定分割點為搭配詞的正則表達式:

library(stringr)
L3toR3 <- data.frame(
  L3 = str_extract(concord$pre, "^\\w+\\b"),
  L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
  L1 = str_extract(concord$pre, "\\w+\\b$"),
  Node = concord$keyword,
  R1 = str_extract(concord$post, "^\\w+\\b"),
  R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
  R3 = str_extract(concord$post, "\\w+\\b$")
)

結果

L3toR3
     L3     L2       L1  Node R1      R2    R3
1  this little sentence  went on     and  went
2  went     on      and  went on      it   was
3    on     it      was going on     for quite
4 quite      a    while going on     for  ages
5    it      s    still going on     and  will
6    on    and     will    go on     and    on
7   and     on      and    go on forever    NA

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM