简体   繁体   中英

Removing dates ( in any format) form a text column

Hope everyone is well. In my dataset there is column including free texts. My goal is to remove all dates in any format form the text. this is a snapshot of the data

df <- data.frame(
  text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
         'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)

                               text
1            tommorow is 2022 11 03
2           I married on 2020-01-01
3 why not going there on 2023/01/14
4          2023 08 01 will be great

The outcome should look like

               text
1            tommorow is 
2            I married on 
3            why not going there on 
4            will be great

Thank you!

Best approach would perhaps be to have a sensitive regex pattern:

df <- data.frame(
  text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
         'why not going there on 2023/01/14','2023 08 01 will be great'))

library(tidyverse)

df |>
  mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))

#>                                text              left_text
#> 1            tommorow is 2022 11 03            tommorow is
#> 2           I married on 2020-01-01           I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4          2023 08 01 will be great          will be great

This will match dates by:

  • \\d{1,4} = starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by
  • \\D = anything that's not a number, ie the separator; followed by
  • \\d{1,2} = day or month (1-2 chars); followed by
  • \\D again; ending with
  • \\d{1,4} = day or year (1-2 or 2-4 chars)

The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:

  • dates with no year
  • dates with no separators
  • dates with double spaces between parts

But hopefully should catch every sensible date in your text column!

Further date detection examples:

library(tidyverse)

df <- data.frame(
  text = c(
    'tommorow is 2022 11 03',
    "I married on 2020-01-01",
    'why not going there on 2023/01/14',
    '2023 08 01 will be great',
    'A trickier example: January 05,2020',
    'or try Oct 2010',
    'dec 21/22 is another date'
  )
)


df |>
  mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |> 
           str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
                                   ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                            ignore_case = TRUE)) |> 
           str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
                                   ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                            ignore_case = TRUE)) |> 
           str_trim())

#>                                  text              left_text
#> 1              tommorow is 2022 11 03            tommorow is
#> 2             I married on 2020-01-01           I married on
#> 3   why not going there on 2023/01/14 why not going there on
#> 4            2023 08 01 will be great          will be great
#> 5 A trickier example: January 05,2020    A trickier example:
#> 6                     or try Oct 2010                 or try
#> 7           dec 21/22 is another date        is another date

Edit 2 - doing replace (with temporary placeholders)

df |>
  mutate(left_text = str_replace(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}", "REP_DATE") |> 
           str_replace(regex(paste0("(", paste(month.name, collapse = "|"),
                                   ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                            ignore_case = TRUE), "REP_DATE") |> 
           str_replace(regex(paste0("(", paste(month.abb, collapse = "|"),
                                   ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                            ignore_case = TRUE), "REP_DATE") |> 
           str_replace("REP_DATE", "25th October 2022"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM