使用 ArangoDB AQL 計算字符串出現次數

Question

要計算包含特定屬性值的對象數量，我可以執行以下操作：

FOR t IN thing
  COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
  FILTER other != false
  RETURN otherCount

但是，如何在同一查詢中計算其他三個出現次數，而不會導致子查詢多次通過同一數據集運行？

我試過這樣的事情：

FOR t IN thing
  COLLECT 
    other = t.name = "Other",
    some = t.name = "Some",
    thing = t.name = "Thing"
  WITH COUNT INTO count
  RETURN {
   other, some, thing,
   count
  }

但我無法理解結果：我一定以錯誤的方式接近這個？

Answer 1

拆分和計數

您可以按短語拆分字符串並從計數中減去 1。 這適用於任何子字符串，另一方面意味着它不考慮單詞邊界。

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(SPLIT(t.name, "Some"))-1
  LET Other = LENGTH(SPLIT(t.name, "Other"))-1
  LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
  RETURN {
   Some, Other, Thing
}

結果：

[
  {
    "Some": 3,
    "Other": 2,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

您可以使用SPLIT(LOWER(t.name), LOWER("..."))使其不區分大小寫。

收集單詞

TOKENS()函數可用於將輸入拆分為單詞數組，然后可以對其進行分組和計數。 請注意，我稍微更改了輸入。 輸入"SomeSome"不會被計算在內，因為"somesome" != "some" （這個變體是單詞而不是基於子字符串的）。

LET things = [
    {name: "Here are SOME some and Some Other Things. More Other!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")

FOR t IN things
  LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
  LET counts = MERGE(FOR w IN whitelisted
    COLLECT word = w WITH COUNT INTO count
    RETURN { [word]: count }
  )
  RETURN {
    name: t.name,
    some: counts.some || 0,
    other: counts.other || 0,
    things: counts.things ||0
  }

結果：

[
  {
    "name": "Here are SOME some and Some Other Things. More Other!",
    "some": 3,
    "other": 2,
    "things": 0
  },
  {
    "name": "There are no such substrings in here.",
    "some": 0,
    "other": 0,
    "things": 0
  },
  {
    "name": "some-Other-here-though!",
    "some": 1,
    "other": 1,
    "things": 0
  }
]

這確實使用了 COLLECT 的子查詢，否則它將計算整個輸入的總出現次數。

白名單步驟不是絕對必要的，您也可以讓它計算所有單詞。 對於較大的輸入字符串，它可能會節省一些內存，以免對您不感興趣的單詞執行此操作。

如果您想精確匹配單詞，您可能需要創建一個單獨的分析器，並為語言禁用詞干。 您還可以關閉規范化（ "accent": true, "case": "none" ）。 另一種方法是對典型的空格和標點符號使用REGEX_SPLIT()以進行更簡單的標記化，但這取決於您的用例。

其他解決方案

我認為不可能在沒有子查詢的情況下使用 COLLECT 獨立計算每個輸入對象，除非您想要總數。

拆分有點麻煩，但您可以將 SPLIT() 替換為 REGEX_SPLIT() 並將搜索短語包裝在\\b以僅在單詞邊界在兩側時才匹配。 那么它應該只匹配單詞（或多或少）：

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
  LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
  LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
  RETURN {
   Some, Other, Thing
}

結果：

[
  {
    "Some": 1,
    "Other": 1,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

一個更優雅的解決方案是使用 ArangoSearch 進行單詞計數，但它沒有讓您檢索單詞出現頻率的功能。 它可能會在內部跟蹤它（分析器功能“頻率” ），但此時它絕對沒有公開。

使用 ArangoDB AQL 計算字符串出現次數

問題描述

1 個解決方案

解決方案1
2 已采納 2020-01-16 00:48:58

使用 ArangoDB AQL 計算字符串出現次數

問題描述

1 個解決方案

解決方案1 2 已采納 2020-01-16 00:48:58

解決方案1
2 已采納 2020-01-16 00:48:58