简体   繁体   English

读取并绘制从大文件中读取的数据的图形

[英]Reading and graphing data read from huge files

We have pretty large files, the order of 1-1.5 GB combined (mostly log files) with raw data that is easily parseable to a csv, which is subsequently supposed to be graphed to generate a set of graph images. 我们有相当大的文件,将1-1.5 GB的数量级(主要是日志文件)与可轻松解析为csv的原始数据相结合,随后应将其图形化以生成一组图形图像。

Currently, we are using bash scripts to turn the raw data into a csv file, with just the numbers that need to be graphed, and then feeding it into a gnuplot script. 当前,我们正在使用bash脚本将原始数据转换为一个csv文件,其中仅包含需要绘制图形的数字,然后将其输入到gnuplot脚本中。 But this process is extremely slow. 但是这个过程非常缓慢。 I tried to speed up the bash scripts by replacing some piped cut s, tr s etc. with a single awk command, although this improved the speed, the whole thing is still very slow. 我试图通过用单个awk命令替换一些管道cut s, tr等来加快bash脚本的速度,尽管这提高了速度,但整个过程仍然很慢。

So, I am starting to believe there are better tools for this process. 因此,我开始相信有用于此过程的更好的工具。 I am currently looking to rewrite this process in python+numpy or R. A friend of mine suggested using the JVM, and if I am to do that, I will use clojure, but am not sure how the JVM will perform. 我目前正在寻找用python + numpy或R重写此过程。我的一个朋友建议使用JVM,如果要这样做,我将使用clojure,但不确定JVM的性能如何。

I don't have much experience in dealing with these kind of problems, so any advice on how to proceed would be great. 我在处理此类问题方面经验不足,因此,任何有关如何进行的建议都将是很棒的。 Thanks. 谢谢。

Edit: Also, I will want to store (to disk) the generated intermediate data, ie, the csv, so I don't have to re-generate it, should I choose I want a different looking graph. 编辑:此外,我将要存储(到磁盘)生成的中间数据,即csv,因此,如果我选择想要其他外观的图,则不必重新生成它。

Edit 2: The raw data files have one record per one line, whose fields are separated by a delimiter ( | ). 编辑2:原始数据文件每行只有一条记录,其字段由定界符( | )分隔。 Not all fields are numbers. 并非所有字段都是数字。 Each field I need in the output csv is obtained by applying a certain formula on the input records, which may use multiple fields from the input data. 我在输出csv中需要的每个字段都是通过在输入记录上应用特定公式获得的,该公式可以使用输入数据中的多个字段。 The output csv will have 3-4 fields per line, and I need graphs that plot 1-2, 1-3, 1-4 fields in a (may be) bar chart. 输出的csv每行将有3-4个字段,我需要在(可能是)条形图中绘制1-2、1-3、1-4字段的图形。 I hope that gives a better picture. 我希望可以提供更好的画面。

Edit 3: I have modified @adirau's script a little and it seems to be working pretty well. 编辑3:我对@adirau的脚本进行了一些修改,它似乎运行得很好。 I have come far enough that I am reading data, sending to a pool of processor threads (pseudo processing, append thread name to data), and aggregating it into an output file, through another collector thread. 我已经走了足够远的距离,可以读取数据,发送到处理器线程池(伪处理,将线程名称附加到数据),并通过另一个收集器线程将其聚合到输出文件中。

PS: I am not sure about the tagging of this question, feel free to correct it. PS:我不确定这个问题的标签,可以随时改正。

python sounds to be a good choice because it has a good threading API (the implementation is questionable though), matplotlib and pylab. python听起来是个不错的选择,因为它具有良好的线程API(尽管实现存在疑问),matplotlib和pylab。 I miss some more specs from your end but maybe this could be a good starting point for you: matplotlib: async plotting with threads . 我从您的末端想起了更多的规格,但这也许对您来说是一个很好的起点: matplotlib:使用线程异步绘图 I would go for a single thread for handling bulk disk i/o reads and sync queueing to a pool of threads for data processing (if you have fixed record lengths things may get faster by precomputing reading offsets and passing just the offsets to the threadpool); 我会选择一个线程来处理大容量磁盘I / O读取,并将队列同步到线程池中进行数据处理(如果您具有固定的记录长度,则可以通过预先计算读取偏移量并将偏移量仅传递给线程池来加快处理速度) ; with the diskio thread I would mmap the datasource files, read a predefined num bytes + one more read to eventually grab the last bytes to the end of the current datasource lineinput; 使用diskio线程,我将映射数据源文件,读取预定义的num字节+再读取一次,以最终将最后一个字节捕获到当前数据源lineinput的末尾; the numbytes should be chosen somewhere near your average lineinput length; 应选择接近平均行输入长度的位置的字节数; next is pool feeding via the queue and the data processing / plotting that takes place in the threadpool; 接下来是通过队列进行池填充,以及在线程池中进行的数据处理/绘图; I don't have a good picture here (of what are you plotting exactly) but I hope this helps. 我在这里的情况不佳(确切地说明了您的绘图),但我希望这会有所帮助。

EDIT: there's file.readlines([sizehint]) to grab multiple lines at once; 编辑:有file.readlines([sizehint])一次抓住多行; well it may not be so speedy cuz the docs are saying its using readline() internally 好吧,它可能不是那么快,因为文档说它在内部使用readline()

EDIT: a quick skeleton code 编辑:快速骨架代码

import threading
from collections import deque
import sys
import mmap


class processor(Thread):
    """
        processor gets a batch of data at time from the diskio thread
    """
    def __init__(self,q):
        Thread.__init__(self,name="plotter")
        self._queue = q
    def run(self):
        #get batched data 
        while True:
            #we wait for a batch
            dataloop = self.feed(self._queue.get())
            try:
                while True:
                    self.plot(dataloop.next())
            except StopIteration:
                pass
            #sanitizer exceptions following, maybe

    def parseline(self,line):
        """ return a data struct ready for plotting """
        raise NotImplementedError

    def feed(self,databuf):
        #we yield one-at-time datastruct ready-to-go for plotting
        for line in databuf:
            yield self.parseline(line)

    def plot(self,data):
        """integrate
        https://www.esclab.tw/wiki/index.php/Matplotlib#Asynchronous_plotting_with_threads
        maybe
        """
class sharedq(object):
    """i dont recall where i got this implementation from 
    you may write a better one"""
    def __init__(self,maxsize=8192):
        self.queue = deque()
        self.barrier = threading.RLock()
        self.read_c = threading.Condition(self.barrier)
        self.write_c = threading.Condition(self.barrier)
        self.msz = maxsize
    def put(self,item):
        self.barrier.acquire()
        while len(self.queue) >= self.msz:
            self.write_c.wait()
        self.queue.append(item)
        self.read_c.notify()
        self.barrier.release()
    def get(self):
        self.barrier.acquire()
        while not self.queue:
            self.read_c.wait()
        item = self.queue.popleft()
        self.write_c.notify()
        self.barrier.release()
        return item



q = sharedq()
#sizehint for readine lines
numbytes=1024
for i in xrange(8):
    p = processor(q)
    p.start()
for fn in sys.argv[1:]
    with open(fn, "r+b") as f:
        #you may want a better sizehint here
        map = mmap.mmap(f.fileno(), 0)
        #insert a loop here, i forgot
        q.put(map.readlines(numbytes))

#some cleanup code may be desirable

I think python+Numpy would be the most efficient way, regarding speed and ease of implementation. 我认为python + Numpy在速度和易于实现方面将是最有效的方法。 Numpy is highly optimized so the performance is decent, and python would ease up the algorithm implementation part. Numpy经过高度优化,因此性能不错,而python将简化算法实现部分。

This combo should work well for your case, providing you optimize the loading of the file on memory, try to find the middle point between processing a data block that isn't too large but large enough to minimize the read and write cycles, because this is what will slow down the program 该组合应该可以很好地适合您的情况,可以优化内存中文件的加载,并尝试找到处理数据块之间的中间点,该数据块不是太大,但足够大以最大程度地减少读写周期,因为是什么会减慢程序

If you feel that this needs more speeding up (which i sincerely doubt), you could use Cython to speed up the sluggish parts. 如果您认为这需要进一步提高速度(我对此表示怀疑),则可以使用Cython来加快缓慢的零件的速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM