![](/img/trans.png)
[英]What is the idiomatic way to assoc several keys/values in a nested map in Clojure?
[英]Clojure: Idiomatic way of merge map's keys based on keys' similarity?
我正在嘗試根據鍵值的相似度合並地圖的鍵,以生成一個新地圖,將相似的鍵值合並為一個。 以下是我的代碼來說明我的想法:
給定一個數據集:
(def engineer-visits (incanter.core/dataset ["Engineer" "Credit" "Comments"]
[
["Jonah" 1 "OK"]
["Jonah" 2 "Very good"]
["Joneh" 0 "Not very good"]
["Joneh" 3 "Excellent"]
["Esther" 2 "Missing comment"]
["Esther" 4 "Extraordinary"]
]
))
值:
| Engineer | Credit | Comments |
|----------+--------+-----------------|
| Jonah | 1 | OK |
| Jonah | 2 | Very good |
| Joneh | 0 | Not very good |
| Joneh | 3 | Excellent |
| Esther | 2 | Missing comment |
| Esther | 4 | Extraordinary |
以下是工程師到他/她的記錄的映射:
(def by-engineers (incanter.core/$group-by "Engineer" engineer-visits ))
值:
{{"Engineer" "Jonah"}
| Engineer | Credit | Comments |
|----------+--------+-----------|
| Jonah | 1 | OK |
| Jonah | 2 | Very good |
, {"Engineer" "Joneh"}
| Engineer | Credit | Comments |
|----------+--------+---------------|
| Joneh | 0 | Not very good |
| Joneh | 3 | Excellent |
, {"Engineer" "Esther"}
| Engineer | Credit | Comments |
|----------+--------+-----------------|
| Esther | 2 | Missing comment |
| Esther | 4 | Extraordinary |
}
使用以下功能,我想獲得:
(map-merged-by-key-value-similarity by-engineers 0.8)
{{"Engineer" "Jonah"}
| Engineer | Credit | Comments |
|----------+--------+---------------|
| Jonah | 1 | OK |
| Jonah | 2 | Very good |
| Joneh | 0 | Not very good |
| Joneh | 3 | Excellent |
, {"Engineer" "Esther"}
| Engineer | Credit | Comments |
|----------+--------+-----------------|
| Esther | 2 | Missing comment |
| Esther | 4 | Extraordinary |
}
(defn map-merged-by-key-value-similarity
"From a map produced by $gorup-by on a datasest, produce a map of the same structure, with key column values merged by similarity."
[a-map threshold]
(let [
column-keys (keys a-map)
key-column-name (->> column-keys
first
keys
first)
;; Deconstruct the key column values from the key of the map, i.e. the pair of column name and column value:
key-column-values (flatten (map vals column-keys))
;; Compute string clusters for the values:
value-simularity-cluster (similarity-cluster key-column-values threshold)
;; Reconstruct the key for the updated map from the clustered column values:
reconstructed-column-value-key-cluster-list (map (fn [cluster]
(map (fn [name]
{key-column-name name})
cluster))
value-simularity-cluster)
representative (fn [cluster] (first cluster)) ; out of a cluster
map-from-cluster-combined-fn (fn [cluster]
; the cluster is a list of maps from key-column-mane to string of the column's value
(if (< 1 (count cluster))
;; combine
(apply merge-with conj-rows (map (fn [key]
{(representative cluster) (a-map key)})
cluster))
;; as is
{(first cluster) (a-map (first cluster))}
))
]
(apply merge (map map-from-cluster-combined-fn reconstructed-column-value-key-cluster-list))
)
)
上面的功能確實按預期工作。 我希望有一種更慣用的方式來實現它。 由於分解地圖的鍵和值,對鍵進行處理以及重新構造類似的圖是相當對稱的過程,因此我認為可以更雄辯地完成它。 我模糊地記得在Scala中,某些Mondard運算符可能對訪問和處理深埋在列表結構中的信息很有用。
感謝您的評論或幫助!
注意: similarity-cluster
將字符串列表轉換為字符串列表,其中相似的字符串將放入封閉列表中。 這是我的實現。 詳細信息與我的問題無關。
當您僅使用表格(具有相同鍵的地圖矢量)而不是Incanter數據集時,事情會稍微容易一些。 但是,有幾個incanter功能可以在它們之間切換。
此外,雖然你可能認為你的similarity-cluster
的實現是不相關的,張貼至少一些作品同樣會使其更容易為人們回答與工作代碼你的問題。
為了測量字符串之間的相似性,我使用了此純函數Levenshtein距離作為levenshtein-distance
函數,並使用了3個編輯的截止點:
(def engineer-visits
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}
{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}])
(defn similarity-matrix
[coll]
(into {} (for [x coll, y coll
:when (< (levenshtein-distance x y) 3)]
[x y])))
(def similarity
(similarity-matrix (distinct (map :engineer engineer-visits))))
=> {"Jonah" "Joneh", "Joneh" "Joneh", "Esther" "Esther"}
(group-by #(get similarity (:engineer %)) engineer-visits)
=>
{"Joneh"
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}],
"Esther"
[{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}]}
值得注意的是,通過將相似性矩陣的元素放入哈希映射, ["Jonah","Jonah"]
鍵值對將被以下["Jonah","Joneh"]
對覆蓋。 ["Joneh","Jonah"]
后面跟着["Joneh","Joneh"]
。 這對於結果非常有幫助。
受尼爾斯回答的啟發,使我的問題更清晰,以下是我將不相關的部分排除在外的問題:
給定一個表,以及一種將“:engineer”列的值聚類的方法,以及一種從聚類中選擇代表值的方法,從那些代表到表中對應行構造映射的表達式是什么?
這是蒸餾溶液。 再次感謝尼爾斯的回答。
(def engineer-visits
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}
{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}])
(defn clusters [names] '(("Jonah" "Joneh") ("Esther")))
(defn representative [cluster] (first cluster))
(def representatives
(->> engineer-visits
(map :engineer)
distinct
clusters
(map (fn [cluster] (apply merge (map (fn [name] {name (representative cluster)}) cluster))))
(apply merge)
))
(group-by #(get representatives (:engineer %)) engineer-visits)
結果=>
{"Jonah"
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}],
"Esther"
[{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}]}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.