[英]Text capture using pattern R - regular expression
我正在嘗試通過模式映射提取所需的單詞。
以下是對象表中的示例數據
+-----------+-------------------------------------------------------------------------------------------------+ | Unique_Id | Text | +-----------+-------------------------------------------------------------------------------------------------+ | Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 | +-----------+-------------------------------------------------------------------------------------------------+
使用下面的代碼
regmatches(table[1,2],gregexpr("2000-\\d{4}",table[1,2]))
能夠提取為
[[1]]
[1] "2000-0511" "2000-0511"
但是輸出正在尋找如下
+-----------+---------------------------------------------------------------------------+-----------+-----------+ | Unique_Id | Text | Column1 | Column2 | +-----------+---------------------------------------------------------------------------+-----------+-----------+ | Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed | 2015-8134 | 2015-8134 | | | the code as 2015-8134 | | | +-----------+---------------------------------------------------------------------------+-----------+-----------+
文本列下的數據多次包含此數字(最多7次),因此需要尋找動態解決方案
非常感謝
這是為您提供的一種方法。 我使用了以下示例數據,稱為foo
。
# id text
# <int> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111.
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666
我首先使用stri_extract_all_regex()
提取了text
。 這將返回一個矩陣,因此我將其轉換為數據幀。 然后,我使用bind_cols()
將其與原始數據集結合在一起。 最后的工作是修改列名稱。 我用gsub()
Column
替換了列名中的X
library(dplyr)
library(stringi)
out <- stri_extract_all_regex(str = foo$text, pattern = "\\d+-\\d+", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
bind_cols(foo,. )
names(out) <- names(out) %>%
gsub(pattern = "X", replacement = "Column")
# id text Column1 Column2 Column3
# <int> <chr> <chr> <chr> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111. 2015-8134 2015-1111
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666 2016-8888 2016-7777 2016-6666
數據
foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.",
"His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
)), .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))
使用stringr
和data.table
:
1)使用str_match_all
提取所有匹配的模式;
2)使用transpose
將提取的模式轉換為列;
3)通過將提取的列與原始列合並來構造新的數據幀;
library(stringr)
library(data.table)
lst = transpose(str_match_all(df$Text, "2015-\\d{4}"))
data.frame(df, setNames(lst, paste0("Column", seq_along(lst))))
# Unique_Id Text Column1 Column2
#1 Ax23z12 Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 2015-8134 2015-8134
#2 By56m22 Tool generated code 2015-8134 upon further validation 2015-8134 <NA>
這樣的事情可能適合您
df[apply(df, 1, function(x) any(grepl("2000-\\d{4}", x))), ]
請參閱此可復制示例
iris[apply(iris, 1, function(x) any(grepl("set", x))), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# etc
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.