简体   繁体   English

处理 large.asc 文件的最快方法是什么?

[英]What is the fastest way to process large .asc files?

I currently have.asc log files that have been generated from CANoe.我目前有从 CANoe 生成的 .asc 日志文件。 I am using python to analyze these files.我正在使用 python 来分析这些文件。 These files are pretty big(anywhere from.5GB-2GB).这些文件非常大(从 5GB 到 2GB 不等)。 To read/analyze the data I am converting the data to a dataframe and I am using the following lines of code to do this:为了读取/分析数据,我将数据转换为 dataframe,并且我使用以下代码行来执行此操作:

    log=can.ASCReader(filePath)
    log=[*log]
    df_data = [{'timestamp':m.timestamp, 'data':m.data} for m in log]
    df = pd.DataFrame(df_data)

Through my analysis, the part that is taking the longest is converting the iterator to a list.通过我的分析,耗时最长的部分是将迭代器转换为列表。 I am wondering if there is a more efficient way of doing that.我想知道是否有更有效的方法来做到这一点。 I am also open to doing the entire process a whole new way if it is faster.如果速度更快,我也愿意以全新的方式完成整个过程。 Currently a.6gb.asc file is taking about 19 minutes to run.目前运行 a.6gb.asc 文件大约需要 19 分钟。 Any help/suggestiosn would be appreciated!任何帮助/建议将不胜感激!

The most time-consuming part is most likely reading from disk.最耗时的部分很可能是从磁盘读取。 This cannot be avoided.这是无法避免的。

However you can make sure that you do not put unnecessary data into memory or copy it around.但是,您可以确保不要将不必要的数据放入 memory 或复制它。

Try the following:尝试以下操作:

import operator
log=can.ASCReader(filePath)
pd.DataFrame(data=map(operator.attrgetter('timestamp', 'data'), log))

ASCReader will return an iterator, ie not reading data until you use log . ASCReader将返回一个迭代器,即在您使用log之前不读取数据。

As you are only interested in the values behind timestamp and data , we declare and attrgetter for these two attributes.由于您只对timestampdata背后的值感兴趣,因此我们为这两个属性声明和attrgetter That is a function that takes an object and will return just the two given attributes of that object.这是一个 function,它采用 object 并将仅返回 object 的两个给定属性。

For applying this attrgetter to the log we will use map .为了将此属性应用于日志,我们将使用map map will apply the attrgetter to each element of log . map将 attrgetter 应用于log的每个元素。 map also returns an iterator, ie it will not read and store any data until used. map还返回一个迭代器,即在使用之前它不会读取和存储任何数据。

Finally we give the map into pandas as the source of data for constructing a DataFrame .最后,我们将 map 放入 pandas 作为构建DataFrame的数据源。

Doing it like this should be the approach with the least amount of copying data around or handling unnecessary data.这样做应该是最少复制数据或处理不必要数据的方法。 YMMV YMMV

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM