[英]Extracting first word after a specific expression in R
I have a column that contains thousands of descriptions like this (example) :我有一列包含数千个这样的描述(示例):
Description![]() |
---|
Building a hospital in the city of LA, USA![]() |
Building a school in the city of NYC, USA![]() |
Building shops in the city of Chicago, USA![]() |
I'd like to create a column with the first word after "city of", like that :我想用“city of”之后的第一个词创建一个列,如下所示:
Description![]() |
City![]() |
---|---|
Building a hospital in the city of LA, USA![]() |
LA![]() |
Building a school in the city of NYC, USA![]() |
NYC![]() |
Building shops in the city of Chicago, USA![]() |
Chicago![]() |
I tried with the following code after seeing this topic Extracting string after specific word , but my column is only filled with missing values在看到此主题后,我尝试使用以下代码在特定单词后提取字符串,但我的列仅填充了缺失值
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))
I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.我查看了 dput() ,输出与我直接在数据帧中看到的描述相同。
This should make the trick for the data you showed:这应该可以解决您显示的数据:
df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:但是,如果您想要整个字符串直到第一个逗号(例如,名称中带有空格的城市),您可以使用:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
Check out the following example:查看以下示例:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
Check out ?regex
:看看
?regex
:
Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed.
模式 (?=...) 和 (?!...) 是零宽度正负前瞻断言:如果尝试匹配...从当前位置向前匹配成功(或失败),则它们匹配,但是在处理的字符串中没有使用任何字符。 Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \\C in ....
模式 (?<=...) 和 (?<!...) 是后视等价物:它们不允许重复量词或 \\C in ....
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.