简体   繁体   English

具有百万行的数据库表

[英]Database table with million of rows

example i have some gps devices that send info to my database every seconds 例如我有一些gps设备每秒钟将信息发送到我的数据库

so 1 device create 1 row in mysql database with these columns (8) 所以一台设备在mysql数据库中用这些列创建1行 (8)

id =12341 date =22.02.2018 time =22:40 langitude =22.236558789 longitude =78.9654582 deviceID =24 name =device-name someinfo =asdadadasd ID = 12341 日期 = 2018年2月22日时间 = 22:40 langitude = 22.236558789 经度 = 78.9654582 设备ID = 24 =设备名称someinfo = asdadadasd

so for 1 minute it create 60 rows , for 24 hours it create 864000 rows and for 1 month(31days) 2678400 ROWS 因此,1分钟它创建60行,24小时它创建864000行和1个月(31天)2678400个ROWS

so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.) so if there are more devices will be 2.6 Million * number of devices 因此,有1台设备每月在我的db表中创建260万行(每月都会删除记录。)因此,如果有更多设备,则将有260万*设备数量

so my questions are like this: 所以我的问题是这样的:

Question 1: if i make a search like this from php ( just for current day and for 1 device) 问题1:如果我从php进行这样的搜索(仅针对当前日期和1个设备)

SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'

max possible results will be 86400 rows 最大可能的结果将是86400行
will it overload my server too much 它会使我的服务器过载太多吗

Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less 问题2:限制为5小时(18000行)会对数据库造成问题,还是像第一个示例或更少示例那样加载服务器

  SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000

Question 3: if i show just 1 result from db will it overload server 问题3:如果我仅显示db的1个结果,它将使服务器过载

 SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1

does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result 这是否意味着如果我只显示1个结果,如果我有数百万行和1000行将加载服务器相同

Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes. 数百万行不是问题, 如果您具有精心设计的架构和良好的索引,这就是SQL数据库的设计目标。

Use proper types 使用正确的类型

Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. 不要将日期和时间存储为单独的字符串,而可以将它们存储为单个datetime或单独的datetime类型。 See indexing below for more about which one to use. 有关使用哪个索引的更多信息,请参见下面的索引。 This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions. 这既更紧凑,允许索引编制,更快地排序,而且无需进行转换即可提供可用的日期和时间功能

Similarly, be sure to use the appropriate numeric type for latitude, and longitude. 同样,请确保对纬度和经度使用适当的数字类型 You'll probably want to use numeric to ensure precision. 您可能需要使用numeric来确保精度。

Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. 由于您将要存储数十亿行,因此请确保对主键使用bigint A regular int can only go up to about 2 billion. 常规int最多只能达到20亿。

Move repeated data into another table. 将重复的数据移到另一个表中。

Instead of storing information about the device in every row, store that in a separate table. 与其在每行中存储有关设备的信息,不如将其存储在单独的表中。 Then only store the device's ID in your log. 然后仅将设备的ID存储在日志中。 This will cut down on your storage size, and eliminate mistakes due to data duplication. 这将减少您的存储空间,并消除由于数据重复而导致的错误。 Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index. 确保将设备ID声明为外键,这将提供参照完整性和索引。

Add indexes 添加索引

Indexes are what allows a database to search through millions or billions of rows very, very efficiently. 索引使数据库可以非常非常高效地搜索数百万或数十亿行。 Be sure there are indexes on the rows you use frequently, such as your timestamp. 确保您经常使用的行上有索引,例如时间戳。

A lack of indexes on date and deviceID is likely why your queries are so slow. datedeviceID索引不足可能是您查询如此缓慢的原因。 Without an index, MySQL has to look at every row in the database known as a full table scan . 没有索引,MySQL必须查看数据库中的每一行,称为全表扫描 This is why your queries are so slow, you're lacking indexes. 这就是为什么您的查询如此缓慢,缺少索引的原因。

You can discover whether your queries are using indexes with explain . 你可以发现你的查询是否使用索引与explain

datetime or time + date ? datetime time还是time + date

Normally it's best to store your date and time in a single column, conventionally called created_at . 通常,最好将日期和时间存储在通常称为created_at的单个列中。 Then you can use date to get just the date part like so. 然后,您可以像这样使用date来获取日期部分。

select *
from gps_logs
where date(created_at) = '2018-07-14'

There's a problem. 有问题 The problem is how indexes work... or don't. 问题在于索引是如何工作的……或不起作用。 Because of the function call, where date(created_at) = '2018-07-14' will not use an index. 由于存在函数调用,因此where date(created_at) = '2018-07-14'将不使用索引。 MySQL will run date(created_at) on every single row. MySQL将在每一行上运行date(created_at) This means a performance killing full table scan. 这意味着会破坏性能的全表扫描。

You can work around this by working with just the datetime column. 您可以通过仅处理datetime列来解决此问题。 This will use an index and be efficient. 这将使用索引并且效率很高。

select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'

Or you can split your single datetime column into date and time columns, but this introduces new problems. 或者,您可以将单个datetime列拆分为datetime列,但这会带来新的问题。 Querying ranges which cross a day boundary becomes difficult. 查询跨越一天边界的范围变得困难。 Like maybe you want a day in a different time zone. 也许您想要在其他时区度过一天。 It's easy with a single column. 单列即可轻松实现。

select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'

But it's more involved with a separate date and time . 但这更多地涉及单独的datetime

select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
  or  (created_date = '2018-07-13' and created_time < '10:00:00');

Or you can switch to a database with partial indexes like Postgresql . 或者,您可以切换到具有部分索引的数据库, 例如Postgresql A partial index allows you to index only part of a value, or the result of a function. 部分索引允许您仅索引值的一部分或函数的结果。 And Postgresql does a lot of things better than MySQL. 而且Postgresql在很多方面都比MySQL更好。 This is what I recommend. 这就是我的建议。

Do as much work in SQL as possible. 在SQL中做尽可能多的工作。

For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day. 例如,如果您想知道每个设备每天有多少个日志条目,而不是将所有行拉出来并自己计算,则可以使用group by按设备和日期对它们进行分组。

select gps_device_id, count(id) as num_entries, created_at::date as day 
from gps_logs
group by gps_device_id, day;

 gps_device_id | num_entries |    day     
---------------+-------------+------------
             1 |       29310 | 2018-07-12
             2 |       23923 | 2018-07-11
             2 |       23988 | 2018-07-12

With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum , count , max , min and so on. 有了这么多的数据,您将要严重依赖group by和关联的聚合函数,例如sumcountmaxmin等。

Avoid select * 避免select *

If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. 如果必须检索86400行,那么简单地从数据库中获取所有数据的成本可能会很高。 You can speed this up significantly by only fetching the columns you need. 通过仅获取所需的列,可以大大加快此过程。 This means using select only, the, specific, columns, you, need rather than select * . 这意味着select only, the, specific, columns, you, need使用select only, the, specific, columns, you, need而不是select *

Putting it all together. 全部放在一起。

In PostgreSQL 在PostgreSQL中

Your schema in PostgreSQL should look something like this. 您在PostgreSQL中的架构应如下所示。

create table gps_devices (
    id serial primary key,
    name text not null

    -- any other columns about the devices
);

create table gps_logs (
    id bigserial primary key,
    gps_device_id int references gps_devices(id),
    created_at timestamp not null default current_timestamp,
    latitude numeric(12,9) not null,
    longitude numeric(12,9) not null
);

create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);

A query can generally only use one index per table. 一个查询通常每个表只能使用一个索引。 Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID. 由于您将一起搜索时间戳和设备ID,因此很多timestamp_and_device结合了对时间戳和设备ID的索引。

date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. date_and_device是同一件事,但是它只是时间戳的日期部分的部分索引。 This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient. 这将使where created_at::date = '2018-07-12' and gps_device_id = 42效率很高。

In MySQL 在MySQL中

create table gps_devices (
    id int primary key auto_increment,
    name text not null

    -- any other columns about the devices
);

create table gps_logs (
    id bigint primary key auto_increment,
    gps_device_id int references gps_devices(id),
    foreign key (gps_device_id) references gps_devices(id),
    created_at timestamp not null default current_timestamp,
    latitude numeric(12,9) not null,
    longitude numeric(12,9) not null
);

create index timestamp_and_device on gps_logs(created_at, gps_device_id);

Very similar, but no partial index. 非常相似,但没有部分索引。 So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types. 因此,您将需要始终在where子句中使用裸露的created_at ,或切换到单独的datetime类型。

Just read you question, for me the Answer is 刚刚读了您的问题,对我来说答案是

Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their. 只需为纬度和经度创建一个单独的表,然后将您的ID外键保存下来即可。

Without knowing the exact queries you want to run I can just guess the best structure. 在不知道要运行的确切查询的情况下,我只能猜测最佳结构。 Having said that, you should aim for the optimal types that use the minimum number of bytes per row. 话虽如此,您应该针对使用每行最少字节数的最佳类型。 This should make your queries faster. 这应该使您的查询更快。

For example, you could use the structure below: 例如,您可以使用以下结构:

create table device (
  id int primary key not null,
  name varchar(20),
  someinfo varchar(100)
);

create table location (
  device_id int not null,
  recorded_at timestamp not null,
  latitude double not null, -- instead of varchar; maybe float?
  longitude double not null, -- instead of varchar; maybe float?
  foreign key (device_id) references device (id)
);

create index ix_loc_dev on location (device_id, recorded_at);

If you include the exact queries (naming the columns) we can create better indexes for them. 如果包括确切的查询(命名列),我们可以为它们创建更好的索引。

Since probably your query selectivity is bad, your queries may run Full Table Scans. 由于您的查询选择性可能不好,因此您的查询可能会运行全表扫描。 For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster: 对于这种情况,我更进一步,我为列使用了尽可能小的数据类型,因此会更快:

create table location (
  device_id tinyint not null,
  recorded_at timestamp not null,
  latitude float not null,
  longitude float not null,
  foreign key (device_id) references device (id)
);

Can't really think of anything smaller than this. 真的想不出比这还小的东西​​。

The best what I can recommend to you is to use time-series database for storing and accessing time-series data. 我能向您推荐的最好的方法是使用时序数据库来存储和访问时序数据。 You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this . 您可以举办任何类型的时间序列数据库引擎的本地,只放一点点的资源投入到它的访问方法开发或使用任何专门的数据库,像远程通讯这样

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM