简体   繁体   中英

Creating a matrix from a list of attributes

I have a CSV with a list of items, and each has a series of attributes attached:

"5","coffee|peaty|sweet|cereal|cream|barley|malt|creosote|sherry|sherry|manuka|honey|peaty|peppercorn|chipotle|chilli|salt|caramel|coffee|demerara|sugar|molasses|spicy|peaty"
"6","oil|lemon|apple|butter|toffee|treacle|sweet|cola|oak|cereal|cinnamon|salt|toffee"

"5" and "6" are both item IDs and unique in the file.

Ultimately, I want to create a matrix demonstrating how many times in the document each attribute was mentioned in the same row with every other attribute. Eg:

        peaty    sweet    cereal    cream    barley ...
coffee    1       2         2         1        1
oil       0       1         0         0        0 

Note that I'd prefer to reduce duplicates: ie, "peaty" isn't both a column and a row.

The original database is essentially a key-value store (A table with columns "itemId" and "value") -- I can reformat the data if it helps.

Any idea how I'd do this with Python, PHP or Ruby (Whichever is easiest)? I get the feeling Python can probably do this the easiest of the bunch but I'm missing something fairly basic and/or crucial (I'm just starting to do data analysis with Python).

Thanks!

Edit: In response to the (somewhat unhelpful) "What have you tried" comment, here's what I'm currently working with (Don't laugh, my Python is terrible):

#!/usr/bin/python
import csv

matrix = {}

with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for attrib in attribs:
            if attrib not in matrix:
                matrix[attrib] = {}
            for attrib2 in attribs:
                if attrib2 in matrix[attrib]:
                    matrix[attrib][attrib2] = matrix[attrib][attrib2] + 1 
                else:
                    matrix[attrib][attrib2] = 1
print matrix 

The output is a big, unsorted dictionary of terms, likely with a lot of duplication between the rows and columns. If I use pandas and replace the "print matrix" line with the following...

from pandas import *
df = DataFrame(matrix).T.fillna(0)
print df

I get:

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, acacia to zesty
Columns: 195 entries, acacia to zesty
dtypes: float64(195)

...Which leads me to think I'm doing something rather wrong.

I'd do this with an undirected graph, where the frequency is the edge weight. Then you can generate the matrix quite easily by looping through each vertex, where each edge weight represents how many times each element occurred with another.

Graph docs: http://networkx.github.io/documentation/latest/reference/classes.graph.html

Starter code:

import csv
import itertools
import networkx as nx

G = nx.Graph()

reader = csv.reader(open('field.csv', "rb"))
for row in reader:
  row_elements = row[1].split("|")
  combinations = itertools.combinations(row_elements, 2)
  for (a, b) in combinations:
    if G.has_edge(a, b):
      G[a][b]['weight'] += 1
    else:
      G.add_edge(a, b, weight=1)

print(G.edges(data=True))

Edit: woah see if this does everything for ya http://networkx.github.io/documentation/latest/reference/linalg.html#module-networkx.linalg.graphmatrix

I would use a Counter with the tuple composed of the 2 strings as key. Off course you'll have every combination in double, but so far I don't see how to avoid this:

from collections import Counter
from itertools import combinations

counter = Counter()
with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for cmb in itertools.combinations(attribs, 2):
            counter[cmb] += 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM