在r中使用gsub的正則表達式模式-從xml文件中獲取較大模式中間的小模式

Question

每個人。 我對 r 中的正則表達式完全陌生，並且在嘗試使用標記的 xml 文件在較大模式中間檢索較小的一組模式時遇到了問題。

在這里，我有一個由 BNC（英國國家語料庫）基本（C5）標簽集系統標記的三詞序列“加強優勢”。 具體來說，我只想在這個長序列中的每個“hw=”之后立即檢索三個詞形還原詞。

<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>

任何人都可以提供 gsub 或 r 中的其他功能的可能解決方案嗎？ 提前謝謝了！

NF

Answer 1

vec <- "<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>"

m <- gregexpr("(?<=hw=)\\S+", vec, perl = T)
regmatches(vec, m)

# [[1]]
# [1] "reinforce" "the"       "advantage"

復制自 regex101.com

/
(?<=hw=)\S+
/

Positive Lookbehind (?<=hw=)

Assert that the Regex below matches
hw= matches the characters hw= literally (case sensitive)

\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])

+ Quantifier — Matches between one and unlimited times, as many times as possible,
giving back as needed (greedy)

首先?unlist然后折疊 ( ?paste0 )

paste0(unlist(
    regmatches(vec, m)
), collapse = " ")

# [1] "reinforce the advantage"

在r中使用gsub的正則表達式模式-從xml文件中獲取較大模式中間的小模式

問題描述

1 個解決方案

解決方案1
0 2018-11-12 11:14:50

在r中使用gsub的正則表達式模式-從xml文件中獲取較大模式中間的小模式

問題描述

1 個解決方案

解決方案1 0 2018-11-12 11:14:50

解決方案1
0 2018-11-12 11:14:50