简体   繁体   English

从 R 中的文本中提取正确的湖名

[英]Extracting Proper Lake Names from Text in R

I am trying to extract the names of Lakes from some text that I have in R. The lakes are proper (capitalized) but will require me extracting a few words on either side of the word "Lake".我试图从 R 中的一些文本中提取湖泊的名称。湖泊是正确的(大写),但需要我在“湖”这个词的两侧提取几个词。

I tried a few things but nothing is working quite the way I want it to... in some cases, a sentence or the article may begin with "Lake" so there is no text before it.我尝试了一些事情,但没有任何事情像我想要的那样工作......在某些情况下,句子或文章可能以“Lake”开头,因此它之前没有文字。 In some cases, the proper name may be 3 words (Lake St. Clair or Red Hawk Lake).在某些情况下,专有名称可能是 3 个词(Lake St. Clair 或 Red Hawk Lake)。

Example code to work with:要使用的示例代码:

text <- paste("Lake Erie is located on the border of the United States and Canada.",
          "It is located nearby to Lake St. Clair and Lake Michigan.",
          "All three lakes have a history of high levels of Phosphorus.",
          "One lake that has not yet been impacted is Lake Ontario.")

This was maybe the closest I got-- pulling from another stack overflow but it's still not working out.这可能是我得到的最接近的 - 从另一个堆栈溢出中提取,但仍然无法解决。

context <- function(text){splittedText <-strsplit(text,'',T) print(splitted Text) data.frame(before = head(c('',splittedText),-1),words=splittedText,after=tail(c(splittedText,''),-1))}

info <- context(text)
print(subset(info, words == 'Lake')

I would like to get either: 1) the proper lakes names extracted ("Lake Erie", "Lake St. Clair", etc.) OR 2) a dataframe with the words before and after "Lake".我想得到:1)提取正确的湖泊名称(“伊利湖”、“圣克莱尔湖”等)或 2)在“湖”之前和之后包含单词的数据框。 Ideally the first but I'm flexible at this point.理想情况下是第一个,但我在这一点上很灵活。

before <- c("","nearby to", "Clair and","impacted is")
Lake <- c("Lake","Lake","Lake","Lake")
after <- c("Erie is","St. Clair", "Michigan ","Ontario ")
output <- data.frame(cbind(before,Lake,after)); print(output)

Thanks in advance for the help!在此先感谢您的帮助!

You need to define some rules to extract words based on the data you have.您需要定义一些规则来根据您拥有的数据提取单词。 Here I get the first word after the word "Lake" .在这里,我得到了"Lake"这个词之后的第一个词。

stringr::str_extract_all(text, "Lake \\w+")[[1]]
#[1] "Lake Erie"     "Lake St"       "Lake Michigan" "Lake Ontario" 

Or similarly in base R或类似地在基础 R

regmatches(text, gregexpr("Lake \\w+", text))[[1]]

For the given text this almost works except for "Lake St. Clair" where it misses "Clair" part.对于给定的text这几乎有效,除了"Lake St. Clair" ,它错过了"Clair"部分。 To handle this we could have defined another rule where in case there is a dot after the next word of "Lake" , we extract two words but this would fail for "Lake Michigan" and "Lake Ontario" since they have full-stop following the word.为了解决这个问题,我们可以定义另一个规则,如果在"Lake"的下一个单词后面有一个点,我们提取两个单词,但是对于"Lake Michigan""Lake Ontario"这将失败,因为它们有句号跟随这个单词。

With stringi , we can use使用stringi ,我们可以使用

library(stringi)
stri_extract_all_regex(text, "Lake\\s+\\w+")[[1]]
#[1] "Lake Erie"     "Lake St"       "Lake Michigan" "Lake Ontario" 

Or using str_match_all或者使用str_match_all

library(stringr)
str_match_all(text, "Lake\\s+\\w+")[[1]][,1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM