I have 2 problems in extracting and transforming data using R. Here's the dataset:
messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test
The problem is:
as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))})))
.
However, with this line of code, message 444... is included because is contains more than 8 digits number.
message_id | customer_ID 1111111111 | 18271801 2222222222 | 12901991 2222222222 | 91222911
I don't know how to efficiently transform the data. The output of dput(df)
:
structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444 ), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed", "wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id", "msg"), row.names = c(NA, 4L), class = "data.frame")
Thanks
Use rebus
to create your regular expression, and stringr
to extract the matches.
You may need to play with the exact form of the regular expression. This code works on your examples, but you'll probably need to adapt it for your dataset.
library(rebus)
library(stringr)
# Create regex
rx <- negative_lookbehind(DGT) %R%
dgt(8) %R%
negative_lookahead(DGT)
rx
## <regex> (?<!\d)[\d]{8}(?!\d)
# Extract the IDs
extracted_ids <- str_extract_all(df$msg, perl(rx))
# Stuff the IDs into a data frame.
data.frame(
messageID = rep(
df$id,
vapply(extracted_ids, length, integer(1))
),
extractedID = unlist(extracted_ids, use.names = FALSE)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.