hello i having the data set which consists to text, whole numbers and decimal numbers, text is a paragraph which will be having all this mix, trying to strip out only the whole numbers and decimal numbers out of the text content, there are about 30k trow entries.
input format of data:
Output:
13 135.67
66.8 6879
3475.5 879 369426
i tried replace all alphabets one by one, but 26+26 replace all is making code lengthy, and replacing "." replaces "." from the numbers also Thanks, Praveen
you can try
library(stringr)
lapply(str_extract_all(a, "[0-9.]+"), function(x) as.numeric(x)[!is.na(as.numeric(x))])
[[1]]
[1] 13.00 135.67
[[2]]
[1] 66.8 6879.0
[[3]]
[1] 3475.5 879.0 369426.0
The basic idea is from here but we include the .
. The lapply
transforms to numeric and excludes NA
's
The data:
a <- c("This. Is a good 13 part. of 135.67 code",
"how to strip 66.8 in the content 6879",
"get the numbers 3475.5 from. The data. 879 in this 369426")
Don't forget that R
has already inbuilt regex functions:
input <- c('This. Is a good 13 part. of 135.67 code', 'how to strip 66.8 in the content 6879',
'get the numbers 3475.5 from. The data. 879 in this 369426')
m <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', input)
(output <- lapply(regmatches(input, m), as.numeric))
This yields
[[1]]
[1] 13.00 135.67
[[2]]
[1] 66.8 6879.0
[[3]]
[1] 3475.5 879.0 369426.0
An option using strsplit
to split in separate lines and then use gsub
to replace [:alpha]
following .
or or just
[:alpha]
.
text <- "1. This. Is a good 13 part. of 135.67 code
2. how to strip 66.8 in the content 6879
3. get the numbers 3475.5 from. The data. 879 in this 369426"
lines <- strsplit(text, split = "\n")[[1]]
gsub("[[:alpha:]]+\\.|[[:alpha:]]+\\s*","",lines)
#[1] "1. 13 135.67 "
#[2] "2. 66.8 6879"
#[3] "3. 3475.5 879 369426"
Another method with gsub
:
string = c('This. Is a good 13 part. of 135.67 code',
'how to strip 66.8 in the content 6879',
'get the numbers 3475.5 from. The data. 879 in this 369426')
trimws(gsub('[\\p{L}\\.\\s](?!\\d)+', '', string, perl = TRUE))
# [1] "13 135.67" "66.8 6879" "3475.5 879 369426"
A solution free of regex and external packages:
sapply(
strsplit(input, " "),
function(x) {
x <- suppressWarnings(as.numeric(x))
paste(x[!is.na(x)], collapse = " ")
}
)
[1] "13 135.67" "66.8 6879" "3475.5 879 369426"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.