簡體   English   中英

Clojure:基於鍵的相似性來合並地圖的鍵的慣用方式?

[英]Clojure: Idiomatic way of merge map's keys based on keys' similarity?

我正在嘗試根據鍵值的相似度合並地圖的鍵,以生成一個新地圖,將相似的鍵值合並為一個。 以下是我的代碼來說明我的想法:

給定一個數據集:

(def engineer-visits (incanter.core/dataset ["Engineer" "Credit" "Comments"]
                              [
                               ["Jonah" 1 "OK"]
                               ["Jonah" 2 "Very good"]
                               ["Joneh" 0 "Not very good"]
                               ["Joneh" 3 "Excellent"]
                               ["Esther" 2 "Missing comment"]
                               ["Esther" 4 "Extraordinary"]

                               ]
                              ))

值:

| Engineer | Credit |        Comments |
|----------+--------+-----------------|
|    Jonah |      1 |              OK |
|    Jonah |      2 |       Very good |
|    Joneh |      0 |   Not very good |
|    Joneh |      3 |       Excellent |
|   Esther |      2 | Missing comment |
|   Esther |      4 |   Extraordinary |

以下是工程師到他/她的記錄的映射:

(def by-engineers (incanter.core/$group-by "Engineer" engineer-visits ))

值:

{{"Engineer" "Jonah"} 
| Engineer | Credit |  Comments |
|----------+--------+-----------|
|    Jonah |      1 |        OK |
|    Jonah |      2 | Very good |
, {"Engineer" "Joneh"} 
| Engineer | Credit |      Comments |
|----------+--------+---------------|
|    Joneh |      0 | Not very good |
|    Joneh |      3 |     Excellent |
, {"Engineer" "Esther"} 
| Engineer | Credit |        Comments |
|----------+--------+-----------------|
|   Esther |      2 | Missing comment |
|   Esther |      4 |   Extraordinary |
}

使用以下功能,我想獲得:

(map-merged-by-key-value-similarity by-engineers 0.8)

{{"Engineer" "Jonah"} 
| Engineer | Credit |      Comments |
|----------+--------+---------------|
|    Jonah |      1 |            OK |
|    Jonah |      2 |     Very good |
|    Joneh |      0 | Not very good |
|    Joneh |      3 |     Excellent |
, {"Engineer" "Esther"} 
| Engineer | Credit |        Comments |
|----------+--------+-----------------|
|   Esther |      2 | Missing comment |
|   Esther |      4 |   Extraordinary |
}


(defn map-merged-by-key-value-similarity
      "From a map produced by $gorup-by on a datasest, produce a map of the same structure, with key column values merged by similarity."
      [a-map threshold]
      (let [
            column-keys (keys a-map)
            key-column-name (->> column-keys
                                 first
                                 keys
                                 first)
            ;; Deconstruct the key column values from the key of the map, i.e. the pair of column name and column value:
            key-column-values (flatten (map vals column-keys)) 
            ;; Compute string clusters for the values:
            value-simularity-cluster (similarity-cluster key-column-values threshold)
            ;; Reconstruct the key for the updated map from the clustered column values:
            reconstructed-column-value-key-cluster-list (map (fn [cluster] 
                                                               (map (fn [name] 
                                                                      {key-column-name name})
                                                                    cluster)) 
                                                             value-simularity-cluster)
            representative (fn [cluster] (first cluster)) ; out of a cluster
            map-from-cluster-combined-fn (fn [cluster] 
                                           ; the cluster is a list of maps from key-column-mane to string of the column's value
                                           (if (< 1 (count cluster))
                                             ;; combine
                                             (apply merge-with conj-rows (map (fn [key] 
                                                                                     {(representative cluster) (a-map key)}) 
                                                                                   cluster))
                                             ;; as is
                                             {(first cluster) (a-map (first cluster))}
                                             ))
            ]
        (apply merge (map map-from-cluster-combined-fn reconstructed-column-value-key-cluster-list))
        )
      )

上面的功能確實按預期工作。 我希望有一種更慣用的方式來實現它。 由於分解地圖的鍵和值,對鍵進行處理以及重新構造類似的圖是相當對稱的過程,因此我認為可以更雄辯地完成它。 我模糊地記得在Scala中,某些Mondard運算符可能對訪問和處理深埋在列表結構中的信息很有用。

感謝您的評論或幫助!

注意: similarity-cluster將字符串列表轉換為字符串列表,其中相似的字符串將放入封閉列表中。 這是我的實現。 詳細信息與我的問題無關。

當您僅使用表格(具有相同鍵的地圖矢量)而不是Incanter數據集時,事情會稍微容易一些。 但是,有幾個incanter功能可以在它們之間切換。

此外,雖然你可能認為你的similarity-cluster的實現是不相關的,張貼至少一些作品同樣會使其容易為人們回答與工作代碼你的問題。

為了測量字符串之間的相似性,我使用了此純函數Levenshtein距離作為levenshtein-distance函數,並使用了3個編輯的截止點:

(def engineer-visits
  [{:comments "OK", :engineer "Jonah", :credit 1}
   {:comments "Very good", :engineer "Jonah", :credit 2}
   {:comments "Not very good", :engineer "Joneh", :credit 0}
   {:comments "Excellent", :engineer "Joneh", :credit 3}
   {:comments "Missing comment", :engineer "Esther", :credit 2}
   {:comments "Extraordinary", :engineer "Esther", :credit 4}])

(defn similarity-matrix
  [coll]
  (into {} (for [x coll, y coll
                 :when (< (levenshtein-distance x y) 3)]
             [x y])))

(def similarity
  (similarity-matrix (distinct (map :engineer engineer-visits))))
=> {"Jonah" "Joneh", "Joneh" "Joneh", "Esther" "Esther"}    

(group-by #(get similarity (:engineer %)) engineer-visits)
=>
{"Joneh"
 [{:comments "OK", :engineer "Jonah", :credit 1}
  {:comments "Very good", :engineer "Jonah", :credit 2}
  {:comments "Not very good", :engineer "Joneh", :credit 0}
  {:comments "Excellent", :engineer "Joneh", :credit 3}],
 "Esther"
 [{:comments "Missing comment", :engineer "Esther", :credit 2}
  {:comments "Extraordinary", :engineer "Esther", :credit 4}]}

值得注意的是,通過將相似性矩陣的元素放入哈希映射, ["Jonah","Jonah"]鍵值對將被以下["Jonah","Joneh"]對覆蓋。 ["Joneh","Jonah"]后面跟着["Joneh","Joneh"] 這對於結果非常有幫助。

受尼爾斯回答的啟發,使我的問題更清晰,以下是我將不相關的部分排除在外的問題:

給定一個表,以及一種將“:engineer”列的值聚類的方法,以及一種從聚類中選擇代表值的方法,從那些代表到表中對應行構造映射的表達式是什么?

這是蒸餾溶液。 再次感謝尼爾斯的回答。

(def engineer-visits
  [{:comments "OK", :engineer "Jonah", :credit 1}
   {:comments "Very good", :engineer "Jonah", :credit 2}
   {:comments "Not very good", :engineer "Joneh", :credit 0}
   {:comments "Excellent", :engineer "Joneh", :credit 3}
   {:comments "Missing comment", :engineer "Esther", :credit 2}
   {:comments "Extraordinary", :engineer "Esther", :credit 4}])

(defn clusters [names] '(("Jonah" "Joneh") ("Esther")))
(defn representative [cluster] (first cluster))

(def representatives 
  (->> engineer-visits
       (map :engineer)
       distinct
       clusters
       (map (fn [cluster] (apply merge (map (fn [name] {name (representative cluster)}) cluster))))
       (apply merge)
       ))

(group-by #(get representatives (:engineer %)) engineer-visits)

結果=>

{"Jonah" 
 [{:comments "OK", :engineer "Jonah", :credit 1} 
  {:comments "Very good", :engineer "Jonah", :credit 2} 
  {:comments "Not very good", :engineer "Joneh", :credit 0} 
  {:comments "Excellent", :engineer "Joneh", :credit 3}], 
 "Esther" 
 [{:comments "Missing comment", :engineer "Esther", :credit 2} 
  {:comments "Extraordinary", :engineer "Esther", :credit 4}]}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM