有沒有辦法或替代 function 在 R 中有一個矢量化的 str_detect()？

Question

我有一個包含一列字符串的數據框，我嘗試過濾掉其中包含日期的字符串。

ID	標題	領域
21	帶有日期 20.01.2009 的酷文本	如何通過購買這本書來賺取低於最低工資。com
22	沒有日期的文字不太酷	拉爾斯.com
23	也是一個很酷的文字，但沒有日期:(	一些cryptostuff.com
24	帶有這樣日期的長文本 2021 年 3 月 3 日	沒有技術背景的人的區塊鏈大師班。com
25	其他帶有這種日期的長文本 03/21/99 以及之后的其他內容	其他。url

我已經編寫了這樣做的代碼。

首先，我使用str_detect()過濾所有包含日期的行的 df。

代碼如下所示：

data <- origin%>%
  filter(str_detect(headline, yyyy_mm_dd)|
         str_detect(headline, mm_dd_yyyy)|
         str_detect(headline, mm_dd_yy)|
         str_detect(headline, dd_mm_yyyy)|
         str_detect(headline, dd_mm_yy)|
         str_detect(headline, annoying_dates)|
         str_detect(headline, monthnum_year)|
         str_detect(headline, monthname_year)|
        str_detect(headline," 20(1|2)\\d\\s"))

mm_dd_yyyy等是我分配正則表達式的變量。 它們看起來像最后一行。

我的代碼工作正常，但我經常使用這些過濾條件，重復使用 function 有點煩人，而且肯定不是好習慣。

我試圖想出一個更好的解決方案，但最終未能如願。 你們有什么想法嗎？ 我想過使用一個可以循環的向量槽，但我不知道這是否可以通過str_detect

Answer 1

如果您使用{tidyverse}系列，請注意{lubridate}有一個非常強大的 function: parse_date_time() 。 后者方便地從任意字符串中“提取”日期。

數據

library(tibble)
ds <- tibble::tribble(
  ~ID,  ~headline, ~SOURCE, ~domain
,  21L, "Cool text with a date 20.01.2009", 0L,             "howtomakelessthanminimumwagebybuyingthisbook.com",
  22L, "not so cool text without date", 0L, "lars.com",
  23L, "also a cool text but without a date :(", 0L, "somecryptostuff.com",
  24L, "long text with a date like this 3. march 2021", 0L, "blockchainmasterclassforpeoplewithouttechnicalbackground.com",
  25L, "other long text with this kind of date in the text 03/21/99 and other sutff afterwards", 0L, "someother.url"
  )

解析日期（時間）

library(dplyr)
library(lubridate)

ds %>% 
  mutate(
    DATE  = lubridate::parse_date_time(headline, orders = c("dmy","mdy"))
  , DATE2 = lubridate::parse_date_time(headline, orders = c("dmy","mdy")) %>%    
                                         as.Date() #if you want a "date" only
  ) %>% 
select(headline, DATE, DATE2)

{lubridate}將對沒有日期的標題發出警告，說明它未能解析該標題（沒有日期）。 您可以將其包裝到處理 NA 案例的呼叫中。

這就是你得到的：

# A tibble: 5 x 3
  headline                                                                               DATE                DATE2     
  <chr>                                                                                  <dttm>              <date>    
1 Cool text with a date 20.01.2009                                                       2009-01-20 00:00:00 2009-01-20
2 not so cool text without date                                                          NA                  NA        
3 also a cool text but without a date :(                                                 NA                  NA        
4 long text with a date like this 3. march 2021                                          2021-03-03 00:00:00 2021-03-03
5 other long text with this kind of date in the text 03/21/99 and other sutff afterwards 1999-03-21 00:00:00 1999-03-21

Answer 2

您可以使用|粘貼所有正則表達式。 分隔符或使用循環 function：

reduce(purrr:map(c(regex1, regex2, ..., " 20(1|2)\\d\\s"), str_detect(.x, headline))), `|`)

str_detect(headline, paste(regex1, regex2, ..., " 20(1|2)\\d\\s"), collapse="|")

有沒有辦法或替代 function 在 R 中有一個矢量化的 str_detect()？

問題描述

2 個解決方案

解決方案1
1 2021-06-11 11:46:03

解決方案2
1 2021-06-11 15:30:34

有沒有辦法或替代 function 在 R 中有一個矢量化的 str_detect()？

問題描述

2 個解決方案

解決方案1 1 2021-06-11 11:46:03

解決方案2 1 2021-06-11 15:30:34

解決方案1
1 2021-06-11 11:46:03

解決方案2
1 2021-06-11 15:30:34