简体   繁体   English

从属性列表创建矩阵

[英]Creating a matrix from a list of attributes

I have a CSV with a list of items, and each has a series of attributes attached: 我有一个带有项目列表的CSV,每个都附加了一系列属性:

"5","coffee|peaty|sweet|cereal|cream|barley|malt|creosote|sherry|sherry|manuka|honey|peaty|peppercorn|chipotle|chilli|salt|caramel|coffee|demerara|sugar|molasses|spicy|peaty"
"6","oil|lemon|apple|butter|toffee|treacle|sweet|cola|oak|cereal|cinnamon|salt|toffee"

"5" and "6" are both item IDs and unique in the file. “5”和“6”都是项目ID,在文件中是唯一的。

Ultimately, I want to create a matrix demonstrating how many times in the document each attribute was mentioned in the same row with every other attribute. 最后,我想创建一个矩阵,演示文档中每个属性在同一行中与每个其他属性提及的次数。 Eg: 例如:

        peaty    sweet    cereal    cream    barley ...
coffee    1       2         2         1        1
oil       0       1         0         0        0 

Note that I'd prefer to reduce duplicates: ie, "peaty" isn't both a column and a row. 请注意,我更愿意减少重复:即“peaty”既不是列也不是行。

The original database is essentially a key-value store (A table with columns "itemId" and "value") -- I can reformat the data if it helps. 原始数据库本质上是一个键值存储(一个列为“itemId”和“value”的表) - 如果有帮助,我可以重新格式化数据。

Any idea how I'd do this with Python, PHP or Ruby (Whichever is easiest)? 知道如何用Python,PHP或Ruby(最简单的那个)做到这一点? I get the feeling Python can probably do this the easiest of the bunch but I'm missing something fairly basic and/or crucial (I'm just starting to do data analysis with Python). 我觉得Python可能最容易做到这一点,但我遗漏了一些相当基本和/或至关重要的东西(我刚开始用Python进行数据分析)。

Thanks! 谢谢!

Edit: In response to the (somewhat unhelpful) "What have you tried" comment, here's what I'm currently working with (Don't laugh, my Python is terrible): 编辑:回应(有点无益)“你有什么尝试”评论,这是我正在使用的(不要笑,我的Python很可怕):

#!/usr/bin/python
import csv

matrix = {}

with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for attrib in attribs:
            if attrib not in matrix:
                matrix[attrib] = {}
            for attrib2 in attribs:
                if attrib2 in matrix[attrib]:
                    matrix[attrib][attrib2] = matrix[attrib][attrib2] + 1 
                else:
                    matrix[attrib][attrib2] = 1
print matrix 

The output is a big, unsorted dictionary of terms, likely with a lot of duplication between the rows and columns. 输出是一个很大的,未排序的术语字典,可能在行和列之间有很多重复。 If I use pandas and replace the "print matrix" line with the following... 如果我使用pandas并用以下内容替换“print matrix”行...

from pandas import *
df = DataFrame(matrix).T.fillna(0)
print df

I get: 我明白了:

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, acacia to zesty
Columns: 195 entries, acacia to zesty
dtypes: float64(195)

...Which leads me to think I'm doing something rather wrong. ......这让我觉得我做错了什么。

I'd do this with an undirected graph, where the frequency is the edge weight. 我用无向图来做到这一点,其中频率是边缘权重。 Then you can generate the matrix quite easily by looping through each vertex, where each edge weight represents how many times each element occurred with another. 然后,您可以通过循环遍历每个顶点来轻松生成矩阵,其中每个边缘权重表示每个元素与另一个元素发生的次数。

Graph docs: http://networkx.github.io/documentation/latest/reference/classes.graph.html 图表文档: http//networkx.github.io/documentation/latest/reference/classes.graph.html

Starter code: 入门代码:

import csv
import itertools
import networkx as nx

G = nx.Graph()

reader = csv.reader(open('field.csv', "rb"))
for row in reader:
  row_elements = row[1].split("|")
  combinations = itertools.combinations(row_elements, 2)
  for (a, b) in combinations:
    if G.has_edge(a, b):
      G[a][b]['weight'] += 1
    else:
      G.add_edge(a, b, weight=1)

print(G.edges(data=True))

Edit: woah see if this does everything for ya http://networkx.github.io/documentation/latest/reference/linalg.html#module-networkx.linalg.graphmatrix 编辑:哇,看看这是否为你所做的一切http://networkx.github.io/documentation/latest/reference/linalg.html#module-networkx.linalg.graphmatrix

I would use a Counter with the tuple composed of the 2 strings as key. 我会使用一个计数器,其中由2个字符串组成的元组作为键。 Off course you'll have every combination in double, but so far I don't see how to avoid this: 当然,你会把所有组合都加倍,但到目前为止我还没有看到如何避免这种情况:

from collections import Counter
from itertools import combinations

counter = Counter()
with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for cmb in itertools.combinations(attribs, 2):
            counter[cmb] += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM