简体   繁体   中英

Efficiency way to clean data in R

Input is

在此处输入图像描述

the row 3 and row 5 had incorrtct format, if I want

sale_date produst_model store_code
20210208 ASUS_DE552 AAE_08072
20210305 ASUS_AC693 AAE_08072
20210107 ASUS_DE551 AAR_7461
20210325 ASUS_DB341 CMHT_654
20210227 ASUS_HG0982 BR_981

If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places. If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:

library(tidyverse)
data.frame(original) %>%
  extract(original,
          into = c("sale_date", "produst_model", "store_code"),
          regex = "(\\d+)_(\\w+\\d+)_(\\w+)")
  sale_date produst_model store_code
1  20210227   ASUS_HG0982     BR_981

Data:

original = "20210227_ASUS_HG0982_BR_981"

Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM