在 R 中的特定表達式之后提取第一個單詞

Question

我有一列包含數千個這樣的描述（示例）：

描述
在美國洛杉磯市建設醫院
在美國紐約市建設一所學校
在美國芝加哥市建造商店

我想用“city of”之后的第一個詞創建一個列，如下所示：

描述	城市
在美國洛杉磯市建設醫院	洛杉磯
在美國紐約市建設一所學校	紐約市
在美國芝加哥市建造商店	芝加哥

在看到此主題后，我嘗試使用以下代碼在特定單詞后提取字符串，但我的列僅填充了缺失值

library(stringr)

df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))

df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))

我查看了 dput() ，輸出與我直接在數據幀中看到的描述相同。

Answer 1

解決方案

這應該可以解決您顯示的數據：

df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")

df
#>                                  Description    city
#> 1 Building a hospital in the city of LA, USA      LA
#> 2  Building a school in the city of NYC, USA     NYC
#> 3 Building shops in the city of Chicago, USA Chicago

選擇

但是，如果您想要整個字符串直到第一個逗號（例如，名稱中帶有空格的城市），您可以使用：

df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")

查看以下示例：

df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
                                 "Building a school in the city of NYC, USA",
                                 "Building shops in the city of Chicago, USA",
                                 "Building a church in the city of Salt Lake City, USA"))

str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA"      "NYC"     "Chicago" "Salt"   

str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA"             "NYC"            "Chicago"        "Salt Lake City"

文檔

看看?regex ：

模式 (?=...) 和 (?!...) 是零寬度正負前瞻斷言：如果嘗試匹配...從當前位置向前匹配成功（或失敗），則它們匹配，但是在處理的字符串中沒有使用任何字符。 模式 (?<=...) 和 (?<!...) 是后視等價物：它們不允許重復量詞或 \\C in ....

在 R 中的特定表達式之后提取第一個單詞

問題描述

1 個解決方案

解決方案1
1 已采納 2021-07-16 15:37:48

解決方案

選擇

文檔

在 R 中的特定表達式之后提取第一個單詞

問題描述

1 個解決方案

解決方案1 1 已采納 2021-07-16 15:37:48

解決方案

選擇

文檔

解決方案1
1 已采納 2021-07-16 15:37:48