简体   繁体   English

存储文本挖掘数据

[英]Storing text mining data

I am looking to track topic popularity on a very large number of documents. 我希望在大量文档中跟踪主题流行度。 Furthermore, I would like to give recommendations to users based on topics, rather than the usual bag of words model. 此外,我想根据主题向用户提供建议,而不是通常的单词模型。 To extract the topics I use natural language processing techniques that are beyond the point of this post. 为了提取主题,我使用了超出本文要点的自然语言处理技术。

My question is how should I persist this data so that: I) I can quickly fetch trending data for each topic (in principle, every time a user opens a document, the topics in that document should go up in popularity) II) I can quickly compare documents to provide recommendations (here I am thinking of using clustering techniques) 我的问题是我应该如何保存这些数据,以便:I)我可以快速获取每个主题的趋势数据(原则上,每次用户打开文档时,该文档中的主题应该会越来越受欢迎)II)我可以快速比较文档以提供建议(这里我正在考虑使用聚类技术)

More specifically, my questions are: 1) Should I go with the usual way of storing text mining data? 更具体地说,我的问题是:1)我应该采用通常的方式存储文本挖掘数据吗? meaning storing a topic occurrence vector for each document, so that I can later measure the euclidean distance between different documents. 意味着为每个文档存储一个主题出现向量,以便我以后可以测量不同文档之间的欧氏距离。 2) Some other way? 2)其他一些方式?

I am looking for specific python ways to do this. 我正在寻找特定的python方法来做到这一点。 I have looked into SQL and NoSQL databases, and also into pytables and h5py, but I am not sure how I would go about implementing a system like this. 我已经研究过SQL和NoSQL数据库,还有pytables和h5py,但我不确定如何实现这样的系统。 One of my concerns is how can I deal with an ever growing vocabulary of topics? 我关心的一个问题是如何处理不断增长的主题词汇?

Thank you very much 非常感谢你

I would suggest that you do this work in a SQL database. 我建议你在SQL数据库中完成这项工作。 You may not want to store the documents there, but the topics are appropriate. 您可能不希望将文档存储在那里,但主题是适当的。

You want one table just for the topics: 您只需要一个表用于主题:

create table Topics (
    TopicId int identity(1,1), -- SQL Server for auto increment column
    TopicName varchar(255),
    CreatedBy varchar(255) default system_user,
    CreatedAt datetime default getdate()

)

You want another table for the topics assigned to documents, assuming that you have some sort of document id to identify documents: 您需要另一个表来分配给文档的主题,假设您有某种文档ID来标识文档:

create table DocumentTopics (
    DocumentTopicId int identity(1,1), -- SQL Server for auto increment column
    TopicId int,
    DocumentID int,
    CreatedBy varchar(255) default system_user,
    CreatedAt datetime default getdate()

)

And another table for document views: 另一个文档视图表:

create table DocumentView (
    DocumentViewId int identity(1,1), -- SQL Server for auto increment column
    DocumentId int,
    ViewedAt datetime,
    viewedBy int, -- some sort of user id
    CreatedBy varchar(255) default system_user,
    CreatedAt datetime default getdate()

)

Now you can get the topics by popularity for a given date range using a query such as: 现在,您可以使用以下查询获取给定日期范围内的热门主题:

select t.TopicId, t.TopicName, count(*) as cnt
from DocumentUsage du join
     DocumentTopics dt
     on du.DocumentId = dt.DocumentId join
     Topics t
     on dt.TopicsId = t.TopicsId
where du.ViewedAt between <date1> and <date2>
group by t.TopicId, t.TopicName
order by 3 desc

You can also get information about users, changes over time, and other information. 您还可以获取有关用户,随时间变化和其他信息的信息。 You could have a users table, which could provide weights for the topics (more reliable users, less reliable users). 您可以拥有一个用户表,它可以为主题提供权重(更可靠的用户,不太可靠的用户)。 This aspect of the system should be done in SQL. 系统的这个方面应该在SQL中完成。

Why not have simple SQL tables 为什么不拥有简单的SQL表

Tables: 表:

  • documents with a primary key of id or file name or something 主键为id或文件名的文档
  • observations with foreign key into documents and the term (indexed on both fields probably unique) 使用外键进行文档和术语的观察(在两个字段上编制索引可能是唯一的)

The array approach you mentioned seems like a slow way to get at terms. 你提到的数组方法似乎是一个缓慢的术语。 With sql you can easily allow new terms be added to the observations table. 使用sql,您可以轻松地将新术语添加到观察表中。

Easy to aggregate and even do trending stuff by aggregating by date if the documents table includes a timestamp. 如果文档表包含时间戳,则通过按日期聚合,易于聚合甚至进行趋势分析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM