简体繁体 English

建议这种情况的数据挖掘算法

[英]data mining algorithm that suggest for this situation

原文 2015-09-06 22:15:54 2 1 excel/ algorithm/ data-mining/ text-mining

This is not a directly programming related question, but it's about selecting the right data mining algorithm.这不是一个直接与编程相关的问题，而是关于选择正确的数据挖掘算法。

i have some folders suppose 100 folders , contents of these folders are images and text documents , i have excel tables ( 100 tables ) for these folders which mean for each folder there are particular table , this excel table content as follow :我有一些文件夹，假设有 100 个文件夹，这些文件夹的内容是图像和文本文档，我有这些文件夹的 excel 表（100 个表），这意味着每个文件夹都有特定的表，这个 excel 表内容如下：

in header ( columns header ) include the content of this folder and rows include the files i would be check ( my test files ) the values in this table are o's and 1's if the file is found in that folder then value 1 otherwise o , these test files names are same for all folders ,在标题（列标题）中包含此文件夹的内容，行包含我要检查的文件（我的测试文件），如果在该文件夹中找到文件，则此表中的值为 o 和 1，否则值为 1 否则为 o ，这些所有文件夹的测试文件名都相同，

Q: what are the best data mining algorithm can work on excel file tables , and can cluster these folders based on test file content for example cluster 1 include folders which contain the files 1 and file 20 and file 25 .. and so on .. consider i use matlab language ?问：什么是最好的数据挖掘算法可以在 excel 文件表上工作，并且可以根据测试文件内容对这些文件夹进行聚类，例如聚类 1 包括包含文件 1 和文件 20 和文件 25 的文件夹......等等..考虑我使用matlab语言？

thanks ...谢谢 ...

1 个解决方案

The english here is a little confusing so I'll interpret the question best as I can.这里的英语有点混乱，所以我会尽可能地解释这个问题。 What you want to do here doesn't seem to require any complicated algorithm.你想在这里做的事情似乎不需要任何复杂的算法。 Go ahead take your excel data and export it as CSV so that you can work in Matlab.继续获取您的 excel 数据并将其导出为 CSV，以便您可以在 Matlab 中工作。

Right now you have data as follows:现在你有如下数据：

Folder -> [ Files ]

You probably want to build an index this way:您可能希望以这种方式构建索引：

File -> [ Folders ]

This way, you when you ask the question: "What folders contain files 1, 20 and 25", you can look up (in constant time) 3 things:这样，当您提出问题：“哪些文件夹包含文件 1、20 和 25”时，您可以（在恒定时间内）查找 3 件事：

Folders that contain file 1包含文件 1 的文件夹
Folders that contain file 20包含文件 20 的文件夹
Folders that contain file 25包含文件 25 的文件夹

And then take the intersection of those sets.然后取这些集合的交集。

=================================================== ================================================== =

The other thing you might be interested in doing is "clustering".您可能感兴趣的另一件事是“聚类”。 For that, go ahead and take your Folder descriptors (the ones and zeroes) and treat that as a feature/vector.为此，请继续使用您的文件夹描述符（1 和 0）并将其视为特征/向量。 Then go ahead and run any clustering algorithm on it.然后继续并在其上运行任何聚类算法。 K-means clustering is an easy one to implement in Matlab. K-means 聚类在 Matlab 中很容易实现。

[1] https://en.wikipedia.org/wiki/Cluster_analysis [1] https://en.wikipedia.org/wiki/Cluster_analysis