简体   繁体   中英

Strip out numbers from text: R

hello i having the data set which consists to text, whole numbers and decimal numbers, text is a paragraph which will be having all this mix, trying to strip out only the whole numbers and decimal numbers out of the text content, there are about 30k trow entries.

input format of data:

  1. This. Is a good 13 part. of 135.67 code
  2. how to strip 66.8 in the content 6879
  3. get the numbers 3475.5 from. The data. 879 in this 369426

Output:

  1. 13 135.67
  2. 66.8 6879
  3. 3475.5 879 369426

i tried replace all alphabets one by one, but 26+26 replace all is making code lengthy, and replacing "." replaces "." from the numbers also Thanks, Praveen

you can try

library(stringr)
lapply(str_extract_all(a, "[0-9.]+"), function(x) as.numeric(x)[!is.na(as.numeric(x))])
[[1]]
[1]  13.00 135.67

[[2]]
[1]   66.8 6879.0

[[3]]
[1]   3475.5    879.0 369426.0

The basic idea is from here but we include the . . The lapply transforms to numeric and excludes NA 's

The data:

a <- c("This. Is a good 13 part. of 135.67 code",
       "how to strip 66.8 in the content 6879",
       "get the numbers 3475.5 from. The data. 879 in this 369426")

Don't forget that R has already inbuilt regex functions:

input <- c('This. Is a good 13 part. of 135.67 code', 'how to strip 66.8 in the content 6879',
           'get the numbers 3475.5 from. The data. 879 in this 369426')

m <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', input)
(output <- lapply(regmatches(input, m), as.numeric))

This yields

[[1]]
[1]  13.00 135.67

[[2]]
[1]   66.8 6879.0

[[3]]
[1]   3475.5    879.0 369426.0

An option using strsplit to split in separate lines and then use gsub to replace [:alpha] following . or or just [:alpha] .

text <- "1. This. Is a good 13 part. of 135.67 code
2. how to strip 66.8 in the content 6879
3. get the numbers 3475.5 from. The data. 879 in this 369426"

lines <- strsplit(text, split = "\n")[[1]]
gsub("[[:alpha:]]+\\.|[[:alpha:]]+\\s*","",lines)
#[1] "1.  13  135.67 "       
#[2] "2. 66.8 6879"          
#[3] "3. 3475.5   879 369426"

Another method with gsub :

string = c('This. Is a good 13 part. of 135.67 code', 
           'how to strip 66.8 in the content 6879',
           'get the numbers 3475.5 from. The data. 879 in this 369426')

trimws(gsub('[\\p{L}\\.\\s](?!\\d)+', '', string, perl = TRUE))
# [1] "13 135.67"         "66.8 6879"         "3475.5 879 369426"

A solution free of regex and external packages:

sapply(
  strsplit(input, " "),
  function(x) {
    x <- suppressWarnings(as.numeric(x))
    paste(x[!is.na(x)], collapse = " ")
  }
)
[1] "13 135.67"         "66.8 6879"         "3475.5 879 369426"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM