简体繁体 English

使用 python 处理大量数据的有效方法是什么？

[英]What is an efficient way to handle large sets of data using python?

原文 2022-04-07 09:41:03 8 1 python/ pandas/ dataframe

I have a large set of data consisting of sensor measurements over time stored into multiple Excel and CSV files.我有大量数据，其中包含随时间推移存储在多个 Excel 和 CSV 文件中的传感器测量值。 A batch of files was created each day.每天创建一批文件。 I need to be able to exctract the data from a particular day, a particular file, a particular time and a particular sensor.我需要能够从特定日期、特定文件、特定时间和特定传感器中提取数据。

The data is displayed like this in each file:数据在每个文件中显示如下：

[DD/MM/YYYY hh:mm:ss, "Sensor1", "Sensor2", ..., "SensorX"] [DD/MM/YYYY hh:mm:ss, "Sensor1", "Sensor2", ..., "SensorX"]

I already have a code to extract said data but only for a specified day, the code is using the panda module to create a dataframe from an excel file and then use it to perform calculations.我已经有一个代码来提取所述数据，但仅限于指定的一天，该代码使用 panda 模块从 excel 文件创建一个 dataframe，然后用它来执行计算。

What I want to do is to find a way to rapidly extract part of this data.我想做的是找到一种方法来快速提取部分数据。 eg : I want to have the measures from sensor 1 and 34 between 01/01/2020 00:00:30 and 05/01/2020 06:00:00.例如：我想在 01/01/2020 00:00:30 和 05/01/2020 06:00:00 之间获取传感器 1 和 34 的测量值。

I was thinking about creating a 3D dataframe with the following dimensions: (Sensor name, time of the day, day), but I don't really know how to do it, is there a way to add a day index to navigate through the data frame?我正在考虑创建具有以下维度的 3D dataframe：（传感器名称、一天中的时间、日期），但我真的不知道该怎么做，有没有办法添加日期索引来浏览数据框？ Something like this:像这样：

(Day 1 [DD/MM/YYYY hh:mm:ss, "Sensor1", "Sensor2", ..., "SensorX"], Day 2 [DD/MM/YYYY hh:mm:ss, "Sensor1", "Sensor2", ..., "SensorX"], ... Day X [DD/MM/YYYY hh:mm:ss, "Sensor1", "Sensor2", ..., "SensorX"]) （第 1 天 [DD/MM/YYYY hh:mm:ss，“Sensor1”，“Sensor2”，...，“SensorX”]，第 2 天 [DD/MM/YYYY hh:mm:ss，“Sensor1”， "Sensor2", ..., "SensorX"], ... 第 X 天 [DD/MM/YYYY hh:mm:ss, "Sensor1", "Sensor2", ..., "SensorX"])

But I am afraid that even when filtering some of the data (I don't need the data of every sensors at once), the code will be very slow.但我担心即使过滤一些数据（我不需要一次需要每个传感器的数据），代码也会很慢。

Also each file doesn't contain the same amount of data eg day1.xlsx is 110x8400 and day2.xlsx is 110x5419, etc... Is this an issue, do I need to be particularly careful?此外，每个文件不包含相同数量的数据，例如 day1.xlsx 是 110x8400 和 day2.xlsx 是 110x5419，等等......这是一个问题，我需要特别小心吗？

Finally, do you have any recommendations about dealing with large amounts of data?最后，您对处理大量数据有什么建议吗？ For context, I will have to compare this experimental data with physical models results.对于上下文，我将不得不将该实验数据与物理模型结果进行比较。 Is there a better way than using pandas dataframes?有没有比使用 pandas 数据帧更好的方法？

I know that many answers are somewhere on the inte.net but I have a lot to do aside and even if I'm used to programming, I have never used many libraries in python, so I already spend a lot of time reading documentation, etc...我知道 inte.net 上有很多答案，但我还有很多事情要做，即使我习惯了编程，我也从未使用过 python 中的许多库，所以我已经花了很多时间阅读文档， ETC...

That's why I'm asking here for insights, so thank you in advance, I hope it all makes sense for you, English is not my first language so please indulge my mistakes.这就是为什么我在这里寻求见解，所以提前谢谢你，我希望这一切对你有意义，英语不是我的第一语言所以请容忍我的错误。

Have a nice day !祝你今天过得愉快！

Edit: I need to use python since I can't install third party programs and I will later use models using the python environment.编辑：我需要使用 python，因为我无法安装第三方程序，稍后我将使用使用 python 环境的模型。

1 个解决方案

I need to be able to exctract the data from a particular day, a particular file, a particular time and a particular sensor.我需要能够从特定日期、特定文件、特定时间和特定传感器中提取数据。

To me this has database written all over the place.对我来说，这到处都是数据库。 There are several time series specific DBs, such as InfluxDB, TimescaleDB or IoTDB.有几个特定于时间序列的数据库，例如 InfluxDB、TimescaleDB 或 IoTDB。 I would recommend you to look at IoTDB as it seems to be targeting your usecase, ie sensor measurements, as they they in their front page:我建议您查看IoTDB ，因为它似乎针对您的用例，即传感器测量，因为它们位于首页：

IoTDB is a specialized database management system for time series data generated by a.network of IoT devices with low computational power. IoTDB 是一个专门的数据库管理系统，用于处理低计算能力的物联网设备网络生成的时间序列数据。

Alternatively you can take a regular SQL database, for example SQLite if you think you will not have huge amounts of data and if you will not need other people being able to access it from some remote server.或者，如果您认为您不会拥有大量数据并且不需要其他人能够从某个远程服务器访问它，那么您可以使用常规的 SQL 数据库，例如 SQLite。 If you want something more powerful PostgreSQL is always a good choice too.如果你想要更强大的东西，PostgreSQL 也是一个不错的选择。

With SQL database you will be able to query your sensor measurements in many different ways, combining different filters and grouping/aggregations of data in case if you need that too.使用 SQL 数据库，您将能够以多种不同的方式查询传感器测量值，结合不同的过滤器和数据分组/聚合，以备不时之需。