從R中的語句中提取數值數據？

Question

我正在（R與openNLP）從給定的語句中提取數值數據。

聲明為"The room temperature is 37 to 39 C. The Air flow is near 80 cfm".

此處的預期輸出為"Temperature > 37 - 39c", "Air flow -> 80cfm" "Temperature > 37 - 39c", "Air flow -> 80cfm" 。

您可以在POS標簽上建議任何正則表達式模式以獲取名詞（NN）和下一個可用數字數據（CD）嗎？

是否有其他替代方法可提取相似數據？

Answer 1

從自然文本中提取數據非常困難！ 我希望這個解決方案會很快解決。 但是這是一種入門的方法。您沒有提供完整的帶標記的句子，所以我插入了自己的標記。 您可能需要為標簽集更改此設置。 同樣，此代碼既無效，也不矢量化，僅適用於單個字符串。

library(stringr)

text <- "The_DT room_NN temperature_NN is_VBZ 37_CD to_PRP 39_CD C_NNU. The_DT Air_NN flow_NN is_VBZ near_ADV 80_CD cfm_NNU"

# find the positions where a Number appears; it may be followed by prepositions, units and other numbers
matches <- gregexpr("(\\w+_CD)+(\\s+\\w+_(NNU|PRP|CD))*", text, perl=TRUE)

mapply(function(position, length) {
  # extract all NN sequences
  nouns <- text %>% str_sub(start = 1, end = position) %>% 
      str_extract_all("\\w+_NN(\\s+\\w+_NN)*")
  # get Numbers
  nums <- text %>% str_sub(start=position, end = position + length)
  # format output string
  result <- paste(tail(nouns[[1]], n=1), nums, sep = " > ")
  # clean tags
  gsub("_\\w+", "", result)
}, matches[[1]], attr(matches[[1]], "match.length"))
# output: [1] "room temperature > 37 to 39 C." "Air flow > 80 cfm"

Answer 2

也許您可以從以下方法開始。 希望這可以幫助！

library(NLP)
library(openNLP)
library(dplyr)

s <- "The room temperature is 37 to 39 C. The Air flow is near 80 cfm"
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
a3 <- annotate(s, pos_tag_annotator, a2)
#distribution of POS tags for word tokens
a3w <- subset(a3, type == "word")

#select consecutive NN & CD POS
a3w_temp <- a3w[sapply(a3w$features, function(x) x$POS == "NN" | x$POS == "CD")]
a3w_temp_df <- as.data.frame(a3w_temp)
#add lead 'features' to dataframe and select rows having (NN, CD) or (NN, CD, CD) sequence
a3w_temp_df$ahead_features = lead(a3w_temp_df$features,1)
a3w_temp_df$features_comb <- paste(a3w_temp_df$features,a3w_temp_df$ahead_features)
l <- row.names(subset(a3w_temp_df, features_comb == "list(POS = \"NN\") list(POS = \"CD\")" |
         features_comb == "list(POS = \"CD\") list(POS = \"CD\")"))
l_final <- sort(unique(c(as.numeric(l), as.numeric(l) +1)))
a3w_df <- a3w_temp_df[l_final,]

#also include POS which is immediately after CD
idx <- a3w_df[a3w_df$features=="list(POS = \"CD\")","id"]+1
idx <- sort(c(idx,a3w_df$id))
op = paste(strsplit(s, split = " ")[[1]][idx -1], collapse = " ")
op

輸出為：

[1] "temperature 37 to 39 C. flow 80 cfm"

從R中的語句中提取數值數據？

問題描述

2 個解決方案

解決方案1
0 2017-08-30 11:44:51

解決方案2
0 2017-08-30 20:02:30

從R中的語句中提取數值數據？

問題描述

2 個解決方案

解決方案1 0 2017-08-30 11:44:51

解決方案2 0 2017-08-30 20:02:30

解決方案1
0 2017-08-30 11:44:51

解決方案2
0 2017-08-30 20:02:30