[英]Counting string occurrences with ArangoDB AQL
要計算包含特定屬性值的對象數量,我可以執行以下操作:
FOR t IN thing
COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
FILTER other != false
RETURN otherCount
但是,如何在同一查詢中計算其他三個出現次數,而不會導致子查詢多次通過同一數據集運行?
我試過這樣的事情:
FOR t IN thing
COLLECT
other = t.name = "Other",
some = t.name = "Some",
thing = t.name = "Thing"
WITH COUNT INTO count
RETURN {
other, some, thing,
count
}
但我無法理解結果:我一定以錯誤的方式接近這個?
拆分和計數
您可以按短語拆分字符串並從計數中減去 1。 這適用於任何子字符串,另一方面意味着它不考慮單詞邊界。
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(SPLIT(t.name, "Some"))-1
LET Other = LENGTH(SPLIT(t.name, "Other"))-1
LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
RETURN {
Some, Other, Thing
}
結果:
[
{
"Some": 3,
"Other": 2,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
您可以使用SPLIT(LOWER(t.name), LOWER("..."))
使其不區分大小寫。
收集單詞
TOKENS()
函數可用於將輸入拆分為單詞數組,然后可以對其進行分組和計數。 請注意,我稍微更改了輸入。 輸入"SomeSome"
不會被計算在內,因為"somesome" != "some"
(這個變體是單詞而不是基於子字符串的)。
LET things = [
{name: "Here are SOME some and Some Other Things. More Other!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")
FOR t IN things
LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
LET counts = MERGE(FOR w IN whitelisted
COLLECT word = w WITH COUNT INTO count
RETURN { [word]: count }
)
RETURN {
name: t.name,
some: counts.some || 0,
other: counts.other || 0,
things: counts.things ||0
}
結果:
[
{
"name": "Here are SOME some and Some Other Things. More Other!",
"some": 3,
"other": 2,
"things": 0
},
{
"name": "There are no such substrings in here.",
"some": 0,
"other": 0,
"things": 0
},
{
"name": "some-Other-here-though!",
"some": 1,
"other": 1,
"things": 0
}
]
這確實使用了 COLLECT 的子查詢,否則它將計算整個輸入的總出現次數。
白名單步驟不是絕對必要的,您也可以讓它計算所有單詞。 對於較大的輸入字符串,它可能會節省一些內存,以免對您不感興趣的單詞執行此操作。
如果您想精確匹配單詞,您可能需要創建一個單獨的分析器,並為語言禁用詞干。 您還可以關閉規范化( "accent": true, "case": "none"
)。 另一種方法是對典型的空格和標點符號使用REGEX_SPLIT()
以進行更簡單的標記化,但這取決於您的用例。
其他解決方案
我認為不可能在沒有子查詢的情況下使用 COLLECT 獨立計算每個輸入對象,除非您想要總數。
拆分有點麻煩,但您可以將 SPLIT() 替換為 REGEX_SPLIT() 並將搜索短語包裝在\\b
以僅在單詞邊界在兩側時才匹配。 那么它應該只匹配單詞(或多或少):
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
RETURN {
Some, Other, Thing
}
結果:
[
{
"Some": 1,
"Other": 1,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
一個更優雅的解決方案是使用 ArangoSearch 進行單詞計數,但它沒有讓您檢索單詞出現頻率的功能。 它可能會在內部跟蹤它(分析器功能“頻率” ),但此時它絕對沒有公開。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.