简体   繁体   中英

how to cluster users based on tags

I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?

Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?

In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).

I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.

I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..

You basically want to cluster the users according to their tags.

To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space , and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].

In your own case, each user can be similarly represented as a point in a 20,000-dimensional space. You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.

Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.

Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, ie You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout. )

Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory , such as information gain to find out which tags contain more information about the cluster.

You should consider using neo4j . You can model your data using the following node labels and relationship types.

If you are not familiar with neo4j's Cypher language notation, (:Foo) represents a node with the label Foo , and [:BAR] represents a relationship with the type BAR . The arrows around a relationship indicate its directionality. neo4j efficiently traverses relationships in both directions.

(:Cluster) -[:INCLUDES_TAG]-> (:Tag) <-[:HAS_TAG]- (:Program) <-[:WATCHED]- (:User)

You'd have k Cluster nodes, 20K Tag nodes, and several million WATCHED relationships.

With this model, starting with any given Cluster node, you can efficiently find all its related tags, programs, and users.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM