R中清理数据的高效方式

Question

Input is输入是

the row 3 and row 5 had incorrtct format, if I want如果我愿意，第 3 行和第 5 行的格式不正确

sale_date发售日期	produst_model产品型号	store_code商店代码
20210208 20210208	ASUS_DE552华硕_DE552	AAE_08072 AAE_08072
20210305 20210305	ASUS_AC693华硕_AC693	AAE_08072 AAE_08072
20210107 20210107	ASUS_DE551华硕_DE551	AAR_7461 AAR_7461
20210325 20210325	ASUS_DB341华硕_DB341	CMHT_654 CMHT_654
20210227 20210227	ASUS_HG0982华硕_HG0982	BR_981 BR_981

If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?如果这个表有 20,000 行，我是否有更有效的方法来检查每一行是否匹配规则？

Answer 1

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places.通过查看发布的数据，我的预感是，三列中的字符串在某些时候是从复合字符串中提取的，例如20210227_ASUS_HG0982_BR_981 ，但在某些地方提取似乎出错了。 If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:如果这个假设是正确的，那么我建议回到原始字符串并修复提取，例如使用extract function ：

library(tidyverse)
data.frame(original) %>%
  extract(original,
          into = c("sale_date", "produst_model", "store_code"),
          regex = "(\\d+)_(\\w+\\d+)_(\\w+)")
  sale_date produst_model store_code
1  20210227   ASUS_HG0982     BR_981

Data:数据：

original = "20210227_ASUS_HG0982_BR_981"

Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.显然，这里的正则表达式仅基于单个字符串，并且可能必须在您有更多字符串时立即进行调整。

R中清理数据的高效方式

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-03 05:52:25

R中清理数据的高效方式

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-03 05:52:25

解决方案1
0 已采纳 2022-09-03 05:52:25