简体   繁体   中英

Extract numerical data from statement in R?

I am working (R with openNLP) on extracting numerical data from the given statements.

The statement is "The room temperature is 37 to 39 C. The Air flow is near 80 cfm".

Here the expected output "Temperature > 37 - 39c", "Air flow -> 80cfm" .

Can you suggest any regex pattern on POS tags to get Noun (NN) and next available number data (CD)?

Is there any alternate approach to extract the similar data?

Data extraction from natural text is hard! I expect that this solution will break very quickly. But here's a way to get you started.You did not supply the whole tagged sentence, so I inserted my own tags. You may need to change this for your tag set. Also this code is neither efficient nor vectorized and will only work for a single string.

library(stringr)

text <- "The_DT room_NN temperature_NN is_VBZ 37_CD to_PRP 39_CD C_NNU. The_DT Air_NN flow_NN is_VBZ near_ADV 80_CD cfm_NNU"

# find the positions where a Number appears; it may be followed by prepositions, units and other numbers
matches <- gregexpr("(\\w+_CD)+(\\s+\\w+_(NNU|PRP|CD))*", text, perl=TRUE)

mapply(function(position, length) {
  # extract all NN sequences
  nouns <- text %>% str_sub(start = 1, end = position) %>% 
      str_extract_all("\\w+_NN(\\s+\\w+_NN)*")
  # get Numbers
  nums <- text %>% str_sub(start=position, end = position + length)
  # format output string
  result <- paste(tail(nouns[[1]], n=1), nums, sep = " > ")
  # clean tags
  gsub("_\\w+", "", result)
}, matches[[1]], attr(matches[[1]], "match.length"))
# output: [1] "room temperature > 37 to 39 C." "Air flow > 80 cfm"

Maybe you can start with below approach. Hope this helps!

library(NLP)
library(openNLP)
library(dplyr)

s <- "The room temperature is 37 to 39 C. The Air flow is near 80 cfm"
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
a3 <- annotate(s, pos_tag_annotator, a2)
#distribution of POS tags for word tokens
a3w <- subset(a3, type == "word")

#select consecutive NN & CD POS
a3w_temp <- a3w[sapply(a3w$features, function(x) x$POS == "NN" | x$POS == "CD")]
a3w_temp_df <- as.data.frame(a3w_temp)
#add lead 'features' to dataframe and select rows having (NN, CD) or (NN, CD, CD) sequence
a3w_temp_df$ahead_features = lead(a3w_temp_df$features,1)
a3w_temp_df$features_comb <- paste(a3w_temp_df$features,a3w_temp_df$ahead_features)
l <- row.names(subset(a3w_temp_df, features_comb == "list(POS = \"NN\") list(POS = \"CD\")" |
         features_comb == "list(POS = \"CD\") list(POS = \"CD\")"))
l_final <- sort(unique(c(as.numeric(l), as.numeric(l) +1)))
a3w_df <- a3w_temp_df[l_final,]

#also include POS which is immediately after CD
idx <- a3w_df[a3w_df$features=="list(POS = \"CD\")","id"]+1
idx <- sort(c(idx,a3w_df$id))
op = paste(strsplit(s, split = " ")[[1]][idx -1], collapse = " ")
op

Output is:

[1] "temperature 37 to 39 C. flow 80 cfm"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM