[英]Efficiency way to clean data in R
Input is输入是
the row 3 and row 5 had incorrtct format, if I want如果我愿意,第 3 行和第 5 行的格式不正确
sale_date![]() |
produst_model![]() |
store_code![]() |
---|---|---|
20210208 ![]() |
ASUS_DE552![]() |
AAE_08072 ![]() |
20210305 ![]() |
ASUS_AC693![]() |
AAE_08072 ![]() |
20210107 ![]() |
ASUS_DE551![]() |
AAR_7461 ![]() |
20210325 ![]() |
ASUS_DB341![]() |
CMHT_654 ![]() |
20210227 ![]() |
ASUS_HG0982![]() |
BR_981 ![]() |
If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?如果这个表有 20,000 行,我是否有更有效的方法来检查每一行是否匹配规则?
From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981
but the extraction seems to have gone wrong in some places.通过查看发布的数据,我的预感是,三列中的字符串在某些时候是从复合字符串中提取的,例如
20210227_ASUS_HG0982_BR_981
,但在某些地方提取似乎出错了。 If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract
function:如果这个假设是正确的,那么我建议回到原始字符串并修复提取,例如使用
extract
function :
library(tidyverse)
data.frame(original) %>%
extract(original,
into = c("sale_date", "produst_model", "store_code"),
regex = "(\\d+)_(\\w+\\d+)_(\\w+)")
sale_date produst_model store_code
1 20210227 ASUS_HG0982 BR_981
Data:数据:
original = "20210227_ASUS_HG0982_BR_981"
Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.显然,这里的正则表达式仅基于单个字符串,并且可能必须在您有更多字符串时立即进行调整。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.