在 R 中的特定表达式之后提取第一个单词

Question

I have a column that contains thousands of descriptions like this (example) :我有一列包含数千个这样的描述（示例）：

Description描述
Building a hospital in the city of LA, USA在美国洛杉矶市建设医院
Building a school in the city of NYC, USA在美国纽约市建设一所学校
Building shops in the city of Chicago, USA在美国芝加哥市建造商店

I'd like to create a column with the first word after "city of", like that :我想用“city of”之后的第一个词创建一个列，如下所示：

Description描述	City城市
Building a hospital in the city of LA, USA在美国洛杉矶市建设医院	LA洛杉矶
Building a school in the city of NYC, USA在美国纽约市建设一所学校	NYC纽约市
Building shops in the city of Chicago, USA在美国芝加哥市建造商店	Chicago芝加哥

I tried with the following code after seeing this topic Extracting string after specific word , but my column is only filled with missing values在看到此主题后，我尝试使用以下代码在特定单词后提取字符串，但我的列仅填充了缺失值

library(stringr)

df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))

df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))

I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.我查看了 dput() ，输出与我直接在数据帧中看到的描述相同。

Answer 1

Solution解决方案

This should make the trick for the data you showed:这应该可以解决您显示的数据：

df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")

df
#>                                  Description    city
#> 1 Building a hospital in the city of LA, USA      LA
#> 2  Building a school in the city of NYC, USA     NYC
#> 3 Building shops in the city of Chicago, USA Chicago

Alternative选择

However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:但是，如果您想要整个字符串直到第一个逗号（例如，名称中带有空格的城市），您可以使用：

df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")

Check out the following example:查看以下示例：

df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
                                 "Building a school in the city of NYC, USA",
                                 "Building shops in the city of Chicago, USA",
                                 "Building a church in the city of Salt Lake City, USA"))

str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA"      "NYC"     "Chicago" "Salt"   

str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA"             "NYC"            "Chicago"        "Salt Lake City"

Documentation文档

Check out ?regex :看看?regex ：

Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed.模式 (?=...) 和 (?!...) 是零宽度正负前瞻断言：如果尝试匹配...从当前位置向前匹配成功（或失败），则它们匹配，但是在处理的字符串中没有使用任何字符。 Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \\C in ....模式 (?<=...) 和 (?<!...) 是后视等价物：它们不允许重复量词或 \\C in ....

在 R 中的特定表达式之后提取第一个单词

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-07-16 15:37:48

Solution解决方案

Alternative选择

Documentation文档

在 R 中的特定表达式之后提取第一个单词

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-07-16 15:37:48

Solution解决方案

Alternative选择

Documentation文档

解决方案1
1 已采纳 2021-07-16 15:37:48