简体   繁体   English

Python使用utf-8编码逐行读取大文件

[英]Python read huge file line by line with utf-8 encoding

I want to read some quite huge files(to be precise: the google ngram 1 word dataset) and count how many times a character occurs. 我想读一些非常大的文件(确切地说:谷歌ngram 1字数据集)并计算一个字符出现的次数。 Now I wrote this script: 现在我写了这个脚本:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

which works fine, until it reaches approx. 哪个工作正常,直到达到约。 line 700.000 of the first file, I then get this error: 第一个文件的700.000行,然后我得到这个错误:

../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
  File "charactercounter.py", line 5, in <module>
    for line in fileinput.input(files):
  File "C:\Python31\lib\fileinput.py", line 254, in __next__
    line = self.readline()
  File "C:\Python31\lib\fileinput.py", line 349, in readline
    self._buffer = self._file.readlines(self._bufsize)
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>

To solve this I searched the web a bit, and came up with this code: 为了解决这个问题,我在网上搜索了一下,并想出了这段代码:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. 但是我现在使用的钩子试图将整个990MB的文件一次性读入内存,这会让我的电脑崩溃。 Does anyone know how to rewrite this code so that it actually works? 有谁知道如何重写这段代码,以便它真正起作用?

ps: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug. ps:代码还没有一直运行,所以我甚至不知道它是否做了它必须做的事情,但为了实现这一点,我首先需要修复这个bug。

Oh, and I use Python 3.2 哦,我使用的是Python 3.2

I do not know why fileinput does not work as expected. 我不知道为什么fileinput没有按预期工作。

I suggest you use the open function instead. 我建议你改用open函数。 The return value can be iterated over and will return lines, just like fileinput. 返回值可以迭代并返回行,就像fileinput一样。

The code will then be something like: 代码将是这样的:

for filename in files:
    print(filename)
    for filelineno, line in enumerate(open(filename, encoding="utf-8")):
        line = line.strip()
        data = line.split('\t')
        # ...

Some documentation links: enumerate , open , io.TextIOWrapper (open returns an instance of TextIOWrapper). 一些文档链接: enumerateopenio.TextIOWrapper (open返回TextIOWrapper的一个实例)。

The problem is that fileinput doesn't use file.xreadlines() , which reads line by line, but file.readline(bufsize) , which reads bufsize bytes at once (and turns that into a list of lines). 问题是fileinput不使用file.xreadlines()读取的file.xreadlines() ,而是使用file.readline(bufsize) ,它一次读取bufsize字节(并将其转换为行列表)。 You are providing 0 for the bufsize parameter of fileinput.input() (which is also the default value). 您为fileinput.input()bufsize参数提供0 (这也是默认值)。 Bufsize 0 means that the whole file is buffered. Bufsize 0表示整个文件是缓冲的。

Solution: provide a reasonable bufsize. 解决方案:提供合理的bufsize。

This works for me: you can use "utf-8" in the hook definition. 这适用于我:你可以在钩子定义中使用“utf-8”。 I used it on a 50GB/200M lines file with no problem. 我在50GB / 200M线文件上使用它没有问题。

fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))

你能不能尝试读取整个文件,但是它的一部分是二进制文件,然后是decode(),然后是proccess,然后再次调用该函数来读取另一部分?

I don't if the one I have is the latest version (and I don't remember how I read them), but... 如果我拥有的是最新版本(我不记得我是如何阅读的),我不会,但......

$ file -i googlebooks-eng-1M-1gram-20090715-0.csv 
googlebooks-eng-1M-1gram-20090715-0.csv: text/plain; charset=us-ascii

Have you tried fileinput.hook_encoded('ascii') or fileinput.hook_encoded('latin_1') ? 您是否尝试过fileinput.hook_encoded('ascii')fileinput.hook_encoded('latin_1') Not sure why this would make a difference, since I think the these are just subsets of unicode with the same mapping, but worth a try. 不知道为什么这会产生影响,因为我认为这些只是具有相同映射的unicode子集,但值得一试。

EDIT I think this might be a bug in fileinput, neither of these work. 编辑我认为这可能是fileinput中的一个错误,这些都不是。

If you are worried about the mem usage, why not read by line using readline() ? 如果您担心内存使用情况,为什么不使用readline()逐行阅读? This will get rid of the memory issues you are running into. 这将消除您遇到的内存问题。 Currently you are reading the full file before performing any actions on the fileObj, with readline() you are not saving the data, merely searching it on a per-line basis. 目前,您在对fileObj执行任何操作之前正在读取完整文件,而readline()则不是保存数据,而是仅按行进行搜索。

def charCount1(_file, _char):
  result = []
  file   = open(_file, encoding="utf-8")
  data   = file.read()
  file.close()
  for index, line in enumerate(data.split("\n")):
    if _char in line:
      result.append(index)
  return result

def charCount2(_file, _char):
  result = []
  count  = 0
  file   = open(_file, encoding="utf-8")
  while 1:
    line = file.readline()
    if _char in line:
      result.append(count)
    count += 1
    if not line: break
  file.close()
  return result

I didn't have a chance to really look over your code but the above samples should give you an idea of how to make the appropriate changes to your structure. 我没有机会真正查看您的代码,但上述示例应该让您了解如何对您的结构进行适当的更改。 charCount1() demonstrates your method which caches the entire file in a single call from read() . charCount1()演示了一个方法,它在read()的单个调用中缓存整个文件。 I tested your method out on a +400MB text file and the python.exe process went as high as +900MB. 我在+ 400MB文本文件上测试了你的方法,python.exe进程高达+ 900MB。 when you run charCount2() , the python.exe process shouldn't exceed more than a few MB's (provided you haven't bulked up the size with other code) ;) 当你运行charCount2()时 ,python.exe进程不应超过几MB(假设你没有用其他代码扩大它的大小);)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM