简体   繁体   中英

data mining algorithm that suggest for this situation

This is not a directly programming related question, but it's about selecting the right data mining algorithm.

i have some folders suppose 100 folders , contents of these folders are images and text documents , i have excel tables ( 100 tables ) for these folders which mean for each folder there are particular table , this excel table content as follow :

in header ( columns header ) include the content of this folder and rows include the files i would be check ( my test files ) the values in this table are o's and 1's if the file is found in that folder then value 1 otherwise o , these test files names are same for all folders ,

Q: what are the best data mining algorithm can work on excel file tables , and can cluster these folders based on test file content for example cluster 1 include folders which contain the files 1 and file 20 and file 25 .. and so on .. consider i use matlab language ?

thanks ...

The english here is a little confusing so I'll interpret the question best as I can. What you want to do here doesn't seem to require any complicated algorithm. Go ahead take your excel data and export it as CSV so that you can work in Matlab.

Right now you have data as follows:

Folder -> [ Files ]

You probably want to build an index this way:

File -> [ Folders ]

This way, you when you ask the question: "What folders contain files 1, 20 and 25", you can look up (in constant time) 3 things:

  1. Folders that contain file 1
  2. Folders that contain file 20
  3. Folders that contain file 25

And then take the intersection of those sets.

===================================================

The other thing you might be interested in doing is "clustering". For that, go ahead and take your Folder descriptors (the ones and zeroes) and treat that as a feature/vector. Then go ahead and run any clustering algorithm on it. K-means clustering is an easy one to implement in Matlab.

[1] https://en.wikipedia.org/wiki/Cluster_analysis

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM