Hope everyone is well. In my dataset there is column including free texts. My goal is to remove all dates in any format form the text. this is a snapshot of the data
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)
text
1 tommorow is 2022 11 03
2 I married on 2020-01-01
3 why not going there on 2023/01/14
4 2023 08 01 will be great
The outcome should look like
text
1 tommorow is
2 I married on
3 why not going there on
4 will be great
Thank you!
Best approach would perhaps be to have a sensitive regex pattern:
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
library(tidyverse)
df |>
mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
This will match dates by:
\\d{1,4}
= starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by\\D
= anything that's not a number, ie the separator; followed by\\d{1,2}
= day or month (1-2 chars); followed by\\D
again; ending with\\d{1,4}
= day or year (1-2 or 2-4 chars) The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:
But hopefully should catch every sensible date in your text column!
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 2010',
'dec 21/22 is another date'
)
)
df |>
mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |>
str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_trim())
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
#> 5 A trickier example: January 05,2020 A trickier example:
#> 6 or try Oct 2010 or try
#> 7 dec 21/22 is another date is another date
df |>
mutate(left_text = str_replace(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}", "REP_DATE") |>
str_replace(regex(paste0("(", paste(month.name, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE), "REP_DATE") |>
str_replace(regex(paste0("(", paste(month.abb, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE), "REP_DATE") |>
str_replace("REP_DATE", "25th October 2022"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.