简体   繁体   English

根据特定条件从R中的字符串中提取数字

[英]Extracting numbers from string in R based on a specific criteria

I'm trying to extract some numbers from a string (comments) based on a specific criteria. 我正在尝试根据特定条件从字符串中提取一些数字(注释)。 The numbers I would like to extract directly follow a date in 24 hour format and always contain a decimal place and are less than 20 (there are other numbers in the string but I'm not interested in these). 我想直接提取的数字遵循24小时格式的日期,并且始终包含小数位且小于20(字符串中还有其他数字,但我对这些不感兴趣)。 I've managed to extract the numbers I would like with the R code below but have no way of relating these back to the ID's they came from. 我已经设法使用下面的R代码提取我想要的数字,但无法将这些数据与他们来自的ID相关联。 Some ID's have multiple numbers of interest, while some only have one. 有些ID有多个兴趣点,有些只有一个。 For example, I need some way to associate the ID number in the dummy data given below with every number of interest. 例如,我需要一些方法将下面给出的虚拟数据中的ID号与每个感兴趣的数字相关联。 As you can see ID 1 contains three results of interest (4.1, 6.9 and 4.3) while ID 2 has only 1 result of interest (6.5). 如您所见,ID 1包含三个感兴趣的结果(4.1,6.9和4.3),而ID 2只有1个感兴趣的结果(6.5)。

Any help would be fantastic! 任何帮助都会很棒!

(An example of the format of comment.txt)

    ID  comments
    1   abc1200 4.1  abc1100 6.9 etd1130 4.3 69.0
    2   abc0900 6.5 abcde 15
    3   3.2 0850 9.5 abc 8.2 0930 12.2 agft 75.0
    4   ashdfalsk 0950 10.5 dvvxcvszv asdasd assdas d 75.0


#rm(list=ls(all=TRUE))

#import text and pull out a list of all numbers contained withtin the free text
raw_text <- read.delim("comment.txt")
numbers_from_text <- gregexpr("[0-9]+.[0-9]", raw_text$comments)

numbers_list <- unlist(regmatches(raw_text$comments, numbers_from_text))
numbers_list <- as.data.frame(numbers_list)

#pull out those numbers that contain an decimal place and create a running count
format<-cbind(numbers_list,dem=(grepl("\\.",as.character(numbers_list$numbers_list)))*1,row.number=1:nrow(numbers_list))

#if the number does not contain a decimal (a date) then create a new row number which is the addition of the first row
#else return NA
test <- cbind(format,new_row = ifelse(format$dem==0, format$row.number+1, "NA"))

#match the cases where the new_row is equal to the row.number and then output the corresponding numbers_list
match <-test$numbers_list[match(test$new_row,test$row.number)]

#get rid of the NA's for where there wasnt a match and values less than 20 to ensure results are correct
match_NA <- subset(match, match!= "<NA>" & as.numeric(as.character(match))<20)

match_NA <- as.data.frame(match_NA) 

Something like this seems to work, matching numerics starting with a blank which contain a period, then converting to numeric and extracting which ones are less than 20. 这样的东西似乎工作,匹配数字从一个包含句点的空白开始,然后转换为数字并提取哪些小于20。

library(stringr)
temp <- apply(comments, 1, function(x) {
  str_extract_all(x,"[[:blank:]][0-9]+[.][0-9]")
})

library(purrr)
temp <- lapply(flatten(temp), function(x) as.numeric(str_trim(x)))
lapply(temp, function(x) x[x <20])

[[1]]
[1] 4.1 6.9 4.3

[[2]]
[1] 6.5

[[3]]
[1]  3.2  9.5  8.2 12.2

[[4]]
[1] 10.5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM