简体   繁体   English

R中清理数据的高效方式

[英]Efficiency way to clean data in R

Input is输入是

在此处输入图像描述

the row 3 and row 5 had incorrtct format, if I want如果我愿意,第 3 行和第 5 行的格式不正确

sale_date发售日期 produst_model产品型号 store_code商店代码
20210208 20210208 ASUS_DE552华硕_DE552 AAE_08072 AAE_08072
20210305 20210305 ASUS_AC693华硕_AC693 AAE_08072 AAE_08072
20210107 20210107 ASUS_DE551华硕_DE551 AAR_7461 AAR_7461
20210325 20210325 ASUS_DB341华硕_DB341 CMHT_654 CMHT_654
20210227 20210227 ASUS_HG0982华硕_HG0982 BR_981 BR_981

If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?如果这个表有 20,000 行,我是否有更有效的方法来检查每一行是否匹配规则?

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places.通过查看发布的数据,我的预感是,三列中的字符串在某些时候是从复合字符串中提取的,例如20210227_ASUS_HG0982_BR_981 ,但在某些地方提取似乎出错了。 If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:如果这个假设是正确的,那么我建议回到原始字符串并修复提取,例如使用extract function :

library(tidyverse)
data.frame(original) %>%
  extract(original,
          into = c("sale_date", "produst_model", "store_code"),
          regex = "(\\d+)_(\\w+\\d+)_(\\w+)")
  sale_date produst_model store_code
1  20210227   ASUS_HG0982     BR_981

Data:数据:

original = "20210227_ASUS_HG0982_BR_981"

Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.显然,这里的正则表达式仅基于单个字符串,并且可能必须在您有更多字符串时立即进行调整。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM