[英]Regular Expression pattern using gsub in r- get a small pattern in the middle of a larger pattern from xml file
每個人。 我對 r 中的正則表達式完全陌生,並且在嘗試使用標記的 xml 文件在較大模式中間檢索較小的一組模式時遇到了問題。
在這里,我有一個由 BNC(英國國家語料庫)基本(C5)標簽集系統標記的三詞序列“加強優勢”。 具體來說,我只想在這個長序列中的每個“hw=”之后立即檢索三個詞形還原詞。
<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>
任何人都可以提供 gsub 或 r 中的其他功能的可能解決方案嗎? 提前謝謝了!
NF
vec <- "<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>"
m <- gregexpr("(?<=hw=)\\S+", vec, perl = T)
regmatches(vec, m)
# [[1]]
# [1] "reinforce" "the" "advantage"
復制自 regex101.com
/
(?<=hw=)\S+
/
Positive Lookbehind (?<=hw=)
Assert that the Regex below matches
hw= matches the characters hw= literally (case sensitive)
\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible,
giving back as needed (greedy)
首先?unlist
然后折疊 ( ?paste0
)
paste0(unlist(
regmatches(vec, m)
), collapse = " ")
# [1] "reinforce the advantage"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.