从属性列表创建矩阵

Question

我有一个带有项目列表的CSV，每个都附加了一系列属性：

"5","coffee|peaty|sweet|cereal|cream|barley|malt|creosote|sherry|sherry|manuka|honey|peaty|peppercorn|chipotle|chilli|salt|caramel|coffee|demerara|sugar|molasses|spicy|peaty"
"6","oil|lemon|apple|butter|toffee|treacle|sweet|cola|oak|cereal|cinnamon|salt|toffee"

“5”和“6”都是项目ID，在文件中是唯一的。

最后，我想创建一个矩阵，演示文档中每个属性在同一行中与每个其他属性提及的次数。 例如：

        peaty    sweet    cereal    cream    barley ...
coffee    1       2         2         1        1
oil       0       1         0         0        0

请注意，我更愿意减少重复：即“peaty”既不是列也不是行。

原始数据库本质上是一个键值存储（一个列为“itemId”和“value”的表） - 如果有帮助，我可以重新格式化数据。

知道如何用Python，PHP或Ruby（最简单的那个）做到这一点？ 我觉得Python可能最容易做到这一点，但我遗漏了一些相当基本和/或至关重要的东西（我刚开始用Python进行数据分析）。

谢谢！

编辑：回应（有点无益）“你有什么尝试”评论，这是我正在使用的（不要笑，我的Python很可怕）：

#!/usr/bin/python
import csv

matrix = {}

with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for attrib in attribs:
            if attrib not in matrix:
                matrix[attrib] = {}
            for attrib2 in attribs:
                if attrib2 in matrix[attrib]:
                    matrix[attrib][attrib2] = matrix[attrib][attrib2] + 1 
                else:
                    matrix[attrib][attrib2] = 1
print matrix

输出是一个很大的，未排序的术语字典，可能在行和列之间有很多重复。 如果我使用pandas并用以下内容替换“print matrix”行...

from pandas import *
df = DataFrame(matrix).T.fillna(0)
print df

我明白了：

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, acacia to zesty
Columns: 195 entries, acacia to zesty
dtypes: float64(195)

......这让我觉得我做错了什么。

Answer 1

我用无向图来做到这一点，其中频率是边缘权重。 然后，您可以通过循环遍历每个顶点来轻松生成矩阵，其中每个边缘权重表示每个元素与另一个元素发生的次数。

图表文档： http ： //networkx.github.io/documentation/latest/reference/classes.graph.html

入门代码：

import csv
import itertools
import networkx as nx

G = nx.Graph()

reader = csv.reader(open('field.csv', "rb"))
for row in reader:
  row_elements = row[1].split("|")
  combinations = itertools.combinations(row_elements, 2)
  for (a, b) in combinations:
    if G.has_edge(a, b):
      G[a][b]['weight'] += 1
    else:
      G.add_edge(a, b, weight=1)

print(G.edges(data=True))

编辑：哇，看看这是否为你所做的一切http://networkx.github.io/documentation/latest/reference/linalg.html#module-networkx.linalg.graphmatrix

Answer 2

我会使用一个计数器，其中由2个字符串组成的元组作为键。 当然，你会把所有组合都加倍，但到目前为止我还没有看到如何避免这种情况：

from collections import Counter
from itertools import combinations

counter = Counter()
with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for cmb in itertools.combinations(attribs, 2):
            counter[cmb] += 1

从属性列表创建矩阵

问题描述

2 个解决方案

解决方案1
1 2013-05-28 16:43:59

解决方案2
1 2013-05-29 08:16:46

从属性列表创建矩阵

问题描述

2 个解决方案

解决方案1 1 2013-05-28 16:43:59

解决方案2 1 2013-05-29 08:16:46

解决方案1
1 2013-05-28 16:43:59

解决方案2
1 2013-05-29 08:16:46