简体   繁体   English

使用大型(数千万行)数据集

[英]Working with large (tens of millions of rows) datasets

For a simple web application the main requirement is to process around 30 (10m * 3 tables) million records as fast as possible. 对于简单的Web应用程序,主要要求是尽可能快地处理大约30(10m * 3表)百万条记录。 I haven't worked with such amount of data before so would like some suggestions/advise from experienced people. 我之前没有处理过这么多数据,所以想要有经验的人提出一些建议/建议。

The database will be holding details of businesses. 该数据库将保存企业的详细信息。 Around 25 attributes will describe a single business; 大约25个属性将描述单个业务; name, address etc. Table structure is as follows. 名称,地址等。表结构如下。

CREATE TABLE IF NOT EXISTS `businesses` (
    `id` bigint(20) NOT NULL AUTO_INCREMENT,
    `type` int(2) NOT NULL,
    `organisation` varchar(40) NOT NULL,
    `title` varchar(12) NOT NULL,
    `given_name` varchar(40) NOT NULL,
    `other_name` varchar(40) NOT NULL,
    `family_name` varchar(40) NOT NULL,
    `suffix` varchar(5) NOT NULL,
    `reg_date` date NOT NULL,
    `main_trade_name` varchar(150) NOT NULL,
    `son_address_l1` varchar(50) NOT NULL,
    `son_address_l2` varchar(50) NOT NULL,
    `son_address_suburb` int(3) NOT NULL,
    `son_address_state` int(2) NOT NULL,
    `son_address_postcode` varchar(10) NOT NULL,
    `son_address_country` int(3) NOT NULL,
    `bus_address_l1` varchar(50) NOT NULL,
    `bus_address_l2` varchar(50) NOT NULL,
    `bus_address_suburb` int(3) NOT NULL,
    `bus_address_state` int(2) NOT NULL,
    `bus_address_postcode` varchar(10) NOT NULL,
    `bus_address_country` int(3) NOT NULL,
    `email` varchar(165) DEFAULT NULL,
    `phone` varchar(12) NOT NULL,
    `website` varchar(80) NOT NULL,
    `employee_size` int(4) NOT NULL,
    PRIMARY KEY (`id`),
    KEY `type` (`type`),
    KEY `phone` (`phone`),
    KEY `reg_date` (`reg_date`),
    KEY `son_address_state` (`son_address_state`),
    KEY `bus_address_state` (`bus_address_state`),
    KEY `son_address_country` (`son_address_country`),
    KEY `bus_address_country` (`bus_address_country`),
    FULLTEXT KEY `title` (`title`),
    FULLTEXT KEY `son_address_l1` (`son_address_l1`),
    FULLTEXT KEY `son_address_l2` (`son_address_l2`),
    FULLTEXT KEY `bus_address_l1` (`bus_address_l1`),
    FULLTEXT KEY `bus_address_l2` (`bus_address_l2`)
) ENGINE=MyISAM;

There going to be 2 other tables like this, reason being each business details will be presented in 3 sources (for comparison purposes). 将会有另外两个这样的表,原因是每个业务细节将在3个来源中呈现(用于比较目的)。 Only one table is going to have writes. 只有一个表可以写入。

About the app usage, 关于应用使用情况,

  1. Few writes, loads of reads. 很少写,大量的读取。
  2. 10 * 3 million of data will not be inserted overtime, its going to be inserted initially. 10 * 300万的数据不会超时插入,最初会插入。
  3. App is not going to have lots of requests, <10 requests per second. 应用程序不会有很多请求,每秒<10个请求。
  4. After the initial data load, users will be updating these details. 在初始数据加载后,用户将更新这些详细信息。 Comparing one table's data with other 2 and update the data in first table. 将一个表的数据与另一个表进行比较,并更新第一个表中的数据。
  5. There will be lots of searches, mainly by name, by address, by phone and state. 会有很多搜索,主要是名称,地址,电话和州。 Single search will go through all the 3 tables. 单次搜索将遍历所有3个表。 Searching needs to be fast. 搜索需要快速。
  6. Planing to build it using PHP 计划使用PHP构建它

My Questions are, 我的问题是,

  1. Is it worth to handle 3 sources within one table rather than having 3 tables? 是否值得在一个表中处理3个源而不是有3个表?
  2. Can MySQL provide a good solution? MySQL能提供一个好的解决方案吗?
  3. Will MongoDB able to handle the same scenario using less hardware resources? MongoDB能否使用更少的硬件资源处理相同的场景?
  4. What's the best way to setup a sample database for testing? 设置示例数据库进行测试的最佳方法是什么? I purchased a Amazon RDS (large) and inserted 10000 records and doubled them until I get 10 million records. 我购买了一个亚马逊RDS(大型)并插入了10000条记录,并将它们加倍,直到我获得了1000万条记录。
  5. Any good reading about this subject? 有关这个主题的任何好的阅读?

Thank You. 谢谢。

I cannot answer to your direct question, but I have experience of working with large datasets. 我无法回答您的直接问题,但我有使用大型数据集的经验。

First thing I would work out is what the majority use case (in your case search) operations woud be, and then consider data storage/partitioning based on the that. 我要解决的第一件事是大多数用例(在你的情况下搜索)操作,然后根据它考虑数据存储/分区。

Next thing is measure, measure, and measure again. 接下来是再次测量,测量和测量。 Some database systems will work well with one kind of operation and others with others. 某些数据库系统适用于某种操作,其他操作适用于其他操作。 As the amount of data increases and operational complexity increases, things that worked well may start to degrade. 随着数据量的增加和操作复杂性的增加,运行良好的事情可能会开始降级。 This is why you measure - don't try to design this without good evidence of how the db systems you're using work under these loads. 这就是您测量的原因 - 如果没有关于您使用的数据库系统如何在这些负载下工作的良好证据,请不要尝试设计此项。

And then work iteratively to the add more operations. 然后迭代地工作以添加更多操作。

Don't try to deisgn a best fit for all. 不要试图最好地适合所有人。 As your design and research is distilled youll see places where optimisations may be needed or availble. 随着您的设计和研究的提炼,您将看到可能需要或可用的优化的地方。 You may also find as we've done in in the past, that different type of caching and indexing may beeded at different times. 您也可以像过去那样发现,不同类型的缓存和索引可能会在不同时间进行。

Good luck - sounds like an interesting project. 祝你好运 - 听起来像一个有趣的项目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM