简体   繁体   English

从RAM读取CSV文件

[英]Reading CSV files from RAM

Situation: I have a CVD (ClamAV Virus Database) file loaded into RAM using mmap. 情况:我使用mmap将CVD(ClamAV病毒数据库)文件加载到RAM中。 The format of every line in the CVD file is same as the one of CSV files (':' delimited). CVD文件中每一行的格式与CSV文件之一相同(以“:”定界)。 Below is a snippet of the code: 以下是代码片段:

def mapping():
    with open("main.cvd", 'rt') as f:
        global mapper
        mapper = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)

def compare(hashed):
    for row in csv.reader(mapper, dialect='delimit'):
        if row[1] == hashed:
            print('Found!')

Problem: When run, it returns the error _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) 问题:运行时返回错误_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)错误_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

Question: How do I read CSV files as text that have been loaded to memory? 问题:如何读取CSV文件作为已加载到内存中的文本?

Additional information 1: I have tried using StringIO , it throws the error TypeError: initial_value must be str or None, not mmap.mmap 附加信息1:我尝试使用StringIO ,它将引发错误TypeError: initial_value must be str or None, not mmap.mmap

Additional information 2: I need the file to be in the RAM for faster access to the file and I cannot sacrifice time reading it line by line using functions such as readline() 附加信息2:我需要将文件放在RAM中以便更快地访问文件,而且我不能牺牲使用诸如readline()函数逐行读取文件的时间。

The csvfile argument to the csv.reader constructor "can be any object which supports the iterator protocol and returns a string each time its next() method is called". csv.reader构造函数的csvfile参数“可以是支持迭代器协议并在每次调用next()方法时返回字符串的任何对象”。

This means the "object" can be a generator function or a generator expression. 这意味着“对象”可以是生成器函数或生成器表达式。 In the code below I've implement a generator function called mmap_file_reader() which will convert the bytes in the memory map into character strings and yield each line of output it detects. 在下面的代码中,我实现了一个名为mmap_file_reader()的生成器函数,该函数会将内存映射中的字节转换为字符串,并yield它检测到的每一行输出。

I made the mmap.mmap constructor call conditional so it would work on Windows, too. 我使mmap.mmap构造函数成为条件调用,因此它也可以在Windows上运行。 This shouldn't be necessary if you used the access= keyword instead of prot= keyword—but I couldn't test that and so did it as shown. 如果您使用access=关键字而不是prot=关键字,则这不是必需的-但我无法对其进行测试,因此如图所示。

import csv
import mmap
import sys

def mapping():
    with open("main.cvd", 'rt') as f:
        global mapper
        if sys.platform.startswith('win32'):
            mmf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)  # windows
        else:
            mmf = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)  # unix
        mapper = mmap_file_reader(mmf)
        csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)

def mmap_file_reader(mmf):
    '''Yield successive lines of the given memory-mapped file as strings.

    Generator function which reads and converts the bytes of the given mmapped file
    to strings and yields them one line at a time.
    '''
    while True:
        line = mmf.readline()
        if not line:  # EOF?
            return
        yield str(line, encoding='utf-8')  # convert bytes of lineread into a string

def compare(hashed):
    for row in csv.reader(mapper, dialect='delimit'):
        if row[1] == hashed:
            print('Found!')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM