这是一个高效的MySQL数据库设计吗？

Question

I am working on a project wherein I have a set of keywords [abc, xyz, klm] `. 我正在开发一个项目，其中我有一组关键字[abc，xyz，klm] `。 I also have a bunch of text files with content [1.txt, 2.txt, 3.txt] . 我还有一堆内容为[1.txt，2.txt，3.txt]的文本文件 。

What I am doing is bumping the keywords against the text files to find the line where the keyword occurs and it can do so multiple times. 我正在做的是将关键字与文本文件碰撞，以找到关键字出现的行，并且可以多次执行此操作。 So I want to store the ID (text file name without .txt), Extracted_Data, Line_Number, Spwaned_Across (keyword may be spread across 2 lines) for each occurence. 所以我想为每次出现存储ID (text file name without .txt), Extracted_Data, Line_Number, Spwaned_Across (keyword may be spread across 2 lines) 。

I decided to create a table for each keyword to store this data. 我决定为每个关键字创建一个表来存储这些数据。

Tables : abc, xyz, klm 表：abc，xyz，klm

Table abc sample data : 表abc示例数据：

ID Extracted_Data                         Line_Number Spawned_Across
12 MySQL is wonderful. What is 'abc'      34          1

So I end up with a table for each keyword. 所以我最终得到了每个关键字的表格。 In my project, there are about 150 keywords and it can grow. 在我的项目中，大约有150个关键字，它可以增长。 So 150 tables. 150个表。

Why did I choose to do this way? 我为什么选择这样做？

For now I am required to find if the keyword exists in a file and I am sure in the future I will be asked to show where or how it occurred in the file. 现在我需要查找关键字是否存在于文件中，并且我相信将来我会被要求显示文件中的位置或方式。 I am planning on creating a table automatically for each new keyword, this way I don't have to manually created each one of them or a giant table with 100s of columns. 我打算为每个新关键字自动创建一个表，这样我就不必手动创建每个关键字或者一个包含100列的巨型表。

Did I make the right decision? 我做出了正确的决定吗？ Your input is highly appreciated. 您的意见非常感谢。

Answer 1

Don't do that. 不要那样做。 No database library is optimized for dynamic table names and you'll end up having to create your query from scratch each time you want to access a table. 没有针对动态表名优化数据库库，并且每次要访问表时，最终都必须从头开始创建查询。 Also, how would you answer questions like "what data did I find on line 34 of file 12"? 另外，您如何回答“我在文件12的第34行找到哪些数据”这样的问题？

You'll want three tables. 你会想要三张桌子。 In PostgreSQL syntax [*], that'd be: 在PostgreSQL语法[*]中，它是：

CREATE TABLE source (sourceid SERIAL, filename VARCHAR NOT NULL);
CREATE TABLE keyword (keywordid SERIAL, keyword VARCHAR NOT NULL);
CREATE TABLE location (locationid SERIAL,
    sourceid INTEGER NOT NULL REFERENCES source(sourceid),
    keyword INTEGER NOT NULL REFERENCES keyword(keywordid),
    data VARCHAR NOT NULL,
    line INTEGER NOT NULL,
    span INTEGER NOT NULL);

When you start processing a new text file, create a new source tuple and remember its sourceid. 当您开始处理新的文本文件时，请创建一个新的source元组并记住它的sourceid。 When you encounter a keyword, either insert a new record for it and remember its keywordid or look up the old record. 当您遇到关键字时，请为其插入新记录并记住其关键字ID或查找旧记录。 Then insert that sourceid, keywordid, and other relevant data into location . 然后将sourceid，keywordid和其他相关数据插入到location 。

To answer the question I posed earlier: 回答我之前提出的问题：

SELECT * FROM
    location JOIN source ON location.sourceid = source.sourceid
    JOIN keyword ON location.keywordid = keyword.keywordid
WHERE
    source.filename = 'foo.txt' AND
    location.line = 34;

Yes, it's more work up front to do it the "right" way but you'll be paid back a million times over in performance, ease of maintenance, and ease of using the results. 是的，以“正确”的方式预先做更多的事情，但是你会在性能，易维护性和易于使用结果方面获得一百万倍的回报。

[*] The MySQL syntax will be similar but I don't remember it off the top of my head and you can figure out the differences pretty easily. [*] MySQL语法将类似，但我不记得它，你可以很容易地找出差异。

Answer 2

I can't see why you can't just store the keyword along the data in one table. 我不明白为什么你不能只将关键字沿着数据存储在一个表中。

ID  Keyword  Extracted_Data  Line_Number Spawned_Across
12  abc      Abc or xyz?..   31337       1
12  xyz      Abc or xyz?..   31337       1
12  xyz      just xyz here   66666       1
13  xyz      xyz travels!    123         1

So you'll have to query by keyword or by file, or by both, all data is present. 因此，您必须按关键字或按文件或两者查询，所有数据都存在。 To normalize further you can store keywords separately in the "keywords" table and keep only the foreign key in the "occurences" table. 要进一步规范化，您可以将关键字分别存储在“keywords”表中，并仅将外键保留在“occurences”表中。

Also it's not very popular to name "ID" anything other than the primary key. 除了主键之外，将“ID”命名为“ID”也不是很受欢迎。

Answer 3

This is definitely a very bad decision . 这绝对是一个非常糟糕的决定 。

Millions of rows is better than millions of tables. 数百万行比数百万个表更好。

Create 2 tables with the suitable foreign keys and you will be fine. 使用合适的外键创建2个表，你会没事的。

I will be asked to show where or how it occurred in the file. 我将被要求显示文件中的位置或方式。

This can still be done in 2 tables 这仍然可以在2个表中完成

Answer 4

I don't think this is efficient. 我不认为这是有效的。 I'm not even sure that a relational database is the right tool for the job. 我甚至不确定关系数据库是否适合这项工作。

New keywords will mean more tables. 新关键字意味着更多表格。 That's not scalable. 这不可扩展。

Keywords and files make me think of indexing and unstructured search. 关键字和文件让我想到了索引和非结构化搜索。 I'd be thinking about Lucene before a relational database. 我会在关系数据库之前考虑Lucene。

这是一个高效的MySQL数据库设计吗？

问题描述

4 个解决方案

解决方案1
6 已采纳 2011-08-02 15:21:52

解决方案2
5 2011-08-02 15:02:36

解决方案3
2 2011-08-02 14:57:43

解决方案4
1 2011-08-02 15:03:33

这是一个高效的MySQL数据库设计吗？

问题描述

4 个解决方案

解决方案1 6 已采纳 2011-08-02 15:21:52

解决方案2 5 2011-08-02 15:02:36

解决方案3 2 2011-08-02 14:57:43

解决方案4 1 2011-08-02 15:03:33

解决方案1
6 已采纳 2011-08-02 15:21:52

解决方案2
5 2011-08-02 15:02:36

解决方案3
2 2011-08-02 14:57:43

解决方案4
1 2011-08-02 15:03:33