Efficiency way to clean data in R

Question

Input is

the row 3 and row 5 had incorrtct format, if I want

sale_date	produst_model	store_code
20210208	ASUS_DE552	AAE_08072
20210305	ASUS_AC693	AAE_08072
20210107	ASUS_DE551	AAR_7461
20210325	ASUS_DB341	CMHT_654
20210227	ASUS_HG0982	BR_981

If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?

Answer 1

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places. If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:

library(tidyverse)
data.frame(original) %>%
  extract(original,
          into = c("sale_date", "produst_model", "store_code"),
          regex = "(\\d+)_(\\w+\\d+)_(\\w+)")
  sale_date produst_model store_code
1  20210227   ASUS_HG0982     BR_981

Data:

original = "20210227_ASUS_HG0982_BR_981"

Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.

Efficiency way to clean data in R

Question

1 answers

solution1
0 ACCPTED 2022-09-03 05:52:25

Efficiency way to clean data in R

Question

1 answers

solution1 0 ACCPTED 2022-09-03 05:52:25

solution1
0 ACCPTED 2022-09-03 05:52:25