如何从非结构化文本中提取某些项目？

Question

I have an extremely unstructured data frame (df) in R, which includes a text column.我在 R 中有一个非常非结构化的数据框 (df)，其中包括一个文本列。

An example of the df$text looks like this df$text 的示例如下所示

John Smith 3.8 GPA johnsmith@gmail.com, https://link.com

I am trying to extract the GPA out of the field and save to a new column called df$GPA but am unable to get it to work.我正在尝试从字段中提取 GPA 并保存到名为 df$GPA 的新列中，但无法使其正常工作。

I have tried:我努力了：

df$gpa <- sub('[0-9].[0-9] GPA',"\\1", df$text)

But that returns the whole block of text.但这会返回整个文本块。

I am also trying to extract the url but am unsure how to do that as well.Does anybody have any suggestions?我也在尝试提取 url 但我也不确定如何做到这一点。有人有什么建议吗？

Answer 1

Here's a solution using positive lookahead in (?=GPA) and str_extract from the package stringr :这是在(?=GPA)中使用正前瞻和str_extract中的stringr的解决方案：

df$GPA <- str_extract(df$text, "\\d+\\.\\d+\\s(?=GPA)")

A sub solution with backreference would be this:具有反向引用的sub解决方案是：

df$GPA <- sub(".*(\\d+\\.\\d+).*", "\\1", df$text)

Result:结果：

df
                                                      text GPA
1 John Smith 3.8 GPA johnsmith@gmail.com, https://link.com 3.8

Data:数据：

df <- data.frame(text = "John Smith 3.8 GPA johnsmith@gmail.com, https://link.com")

Answer 2

We can use a regex lookaround to extract the numeric part我们可以使用正则表达式环视来提取数字部分

library(stringr)
df$GPA <- str_extract(df$text, "[0-9.]+(?=\\s*GPA)")
df$GPA
#[1] "3.8"

Or in base R with regmatches/regexpr或者在带有regmatches/regexpr的base R

regmatches(df$text, regexpr("[0-9.]+(?=\\s*GPA)", df$text, perl = TRUE))

data数据

df <- data.frame(text = 'John Smith 3.8 GPA johnsmith@gmail.com, https://link.com', stringsAsFactors = FALSE)

如何从非结构化文本中提取某些项目？

问题描述

2 个解决方案

解决方案1
2 2020-05-20 20:56:57

解决方案2
0 2020-05-20 20:56:55

data数据

如何从非结构化文本中提取某些项目？

问题描述

2 个解决方案

解决方案1 2 2020-05-20 20:56:57

解决方案2 0 2020-05-20 20:56:55

data数据

解决方案1
2 2020-05-20 20:56:57

解决方案2
0 2020-05-20 20:56:55