使用模式R捕獲文本-正則表達式

Question

我正在嘗試通過模式映射提取所需的單詞。

以下是對象表中的示例數據

+-----------+-------------------------------------------------------------------------------------------------+
| Unique_Id |                                               Text                                              |
+-----------+-------------------------------------------------------------------------------------------------+
| Ax23z12   | Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 |
+-----------+-------------------------------------------------------------------------------------------------+

使用下面的代碼

regmatches(table[1,2],gregexpr("2000-\\d{4}",table[1,2]))

能夠提取為

[[1]]
[1] "2000-0511" "2000-0511"

但是輸出正在尋找如下

+-----------+---------------------------------------------------------------------------+-----------+-----------+
| Unique_Id |                                    Text                                   |  Column1  |  Column2  |
+-----------+---------------------------------------------------------------------------+-----------+-----------+
| Ax23z12   | Tool generated code 2015-8134 upon further validation, the tool confirmed | 2015-8134 | 2015-8134 |
|           |   the code as 2015-8134                                                   |           |           |
+-----------+---------------------------------------------------------------------------+-----------+-----------+

文本列下的數據多次包含此數字（最多7次），因此需要尋找動態解決方案

非常感謝

Answer 1

這是為您提供的一種方法。 我使用了以下示例數據，稱為foo 。

#     id                                                                     text
#  <int>                                                                    <chr>
#1     1                Here is my code, 2015-8134. Here is your code, 2015-1111.
#2     2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666

我首先使用stri_extract_all_regex()提取了text 。 這將返回一個矩陣，因此我將其轉換為數據幀。 然后，我使用bind_cols()將其與原始數據集結合在一起。 最后的工作是修改列名稱。 我用gsub() Column替換了列名中的X

library(dplyr)
library(stringi)

out <- stri_extract_all_regex(str = foo$text, pattern = "\\d+-\\d+", simplify = TRUE) %>%
                              data.frame(stringsAsFactors = FALSE) %>%
       bind_cols(foo,. )

names(out) <- names(out) %>%
              gsub(pattern = "X", replacement = "Column")

#     id                                                                     text   Column1   Column2   Column3
#  <int>                                                                    <chr>     <chr>     <chr>     <chr>
#1     1                Here is my code, 2015-8134. Here is your code, 2015-1111. 2015-8134 2015-1111          
#2     2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666 2016-8888 2016-7777 2016-6666

數據

foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.", 
"His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
)), .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))

Answer 2

使用stringr和data.table ：

1）使用str_match_all提取所有匹配的模式；

2）使用transpose將提取的模式轉換為列；

3）通過將提取的列與原始列合並來構造新的數據幀；

library(stringr)
library(data.table)

lst = transpose(str_match_all(df$Text, "2015-\\d{4}"))
data.frame(df, setNames(lst, paste0("Column", seq_along(lst))))
#  Unique_Id                                                                                            Text   Column1   Column2
#1   Ax23z12 Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 2015-8134 2015-8134
#2   By56m22                                           Tool generated code 2015-8134 upon further validation 2015-8134      <NA>

Answer 3

這樣的事情可能適合您

df[apply(df, 1, function(x) any(grepl("2000-\\d{4}", x))), ]

請參閱此可復制示例

iris[apply(iris, 1, function(x) any(grepl("set", x))), ]

   # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1           5.1         3.5          1.4         0.2  setosa
# 2           4.9         3.0          1.4         0.2  setosa
# 3           4.7         3.2          1.3         0.2  setosa
# 4           4.6         3.1          1.5         0.2  setosa
# 5           5.0         3.6          1.4         0.2  setosa
# 6           5.4         3.9          1.7         0.4  setosa
# etc

使用模式R捕獲文本-正則表達式

問題描述

3 個解決方案

解決方案1
3 2017-09-17 15:03:03

解決方案2
2 已采納 2017-09-17 14:26:49

解決方案3
0 2017-09-17 14:13:29

使用模式R捕獲文本-正則表達式

問題描述

3 個解決方案

解決方案1 3 2017-09-17 15:03:03

解決方案2 2 已采納 2017-09-17 14:26:49

解決方案3 0 2017-09-17 14:13:29

解決方案1
3 2017-09-17 15:03:03

解決方案2
2 已采納 2017-09-17 14:26:49

解決方案3
0 2017-09-17 14:13:29