简体   繁体   English

在Python中打开没有换行符的大型JSON文件以进行csv转换Python 2.6.6

[英]Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6

I am attempting to convert a very large json file to csv. 我正在尝试将非常大的json文件转换为csv。 I have been able to convert a small file of this type to a 10 record (for example) csv file. 我已经能够将这种类型的小文件转换为10条记录(例如)csv文件。 However, when trying to convert a large file (on the order of 50000 rows in the csv file) it does not work. 但是,当尝试转换大文件(csv文件中约50000行)时,它不起作用。 The data was created by a curl command with the -o pointing to the json file to be created. 数据是通过curl命令创建的,其中-o指向要创建的json文件。 The file that is output does not have newline characters in it. 输出的文件中没有换行符。 The csv file will be written with csv.DictWriter() and (where data is the json file input) has the form csv文件将使用csv.DictWriter()编写,并且(其中数据是json文件输入)的格式为

rowcount = len(data['MainKey'])
colcount = len(data['MainKey'][0]['Fields'])

I then loop through the range of the rows and columns to get the csv dictionary entries 然后,我遍历行和列的范围以获取csv词典条目

csvkey = data['MainKey'][recno]['Fields'][colno]['name']
cvsval = data['MainKey'][recno][['Fields'][colno]['Values']['value']

I attempted to use the answers from other questions, but they did not work with a big file ( du -m bigfile.json = 157 ) and the files that I want to handle are even larger. 我尝试使用其他问题的答案,但它们不适用于大文件( du -m bigfile.json = 157 ),并且我要处理的文件更大。

An attempt to get the size of each line shows 尝试获取每行的大小将显示

myfile = open('file.json','r').
line = readline():
print len(line)

shows that this reads the entire file as a full string. 显示这将读取整个文件为完整字符串。 Thus, one small file will show a length of 67744, while a larger file will show 163815116. 因此,一个小文件将显示67744的长度,而大文件将显示163815116。

An attempt to read the data directly from 尝试直接从中读取数据

data=json.load(infile)

gives the error that other questions have discussed for the large files 给出了其他问题针对大文件讨论的错误

An attempt to use the 尝试使用

def json_parse(self, fileobj, decoder=JSONDecoder(), buffersize=2048):


  yield results

as shown in another answer , works with a 72 kb file (10 rows, 22 columns) but seems to either lock up or take an interminable amount of time for an intermediate sized file of 157 mb (from du -m bigfile.json) 另一个答案所示, 文件可处理72 kb的文件(10行22列),但似乎对于中等大小的157 mb文件来说是锁定的或占用无数时间(来自du -m bigfile.json)

Note that a debug print shows that each chunk is 2048 in size as specified by the default input argument. 请注意,调试打印显示每个块的大小为缺省输入参数指定的2048。 It appears that it is trying to go through the entire 163815116 (shown from the len above) in 2048 chunks. 看来它正尝试遍历2048个块中的整个163815116(如上面的len所示)。 If I change the chunk size to 32768, simple math shows that it would take 5,000 cycles through the loop to process the file. 如果将块大小更改为32768,则简单的数学计算表明,循环需要5,000个周期来处理文件。

A change to a chunk size of 524288 exits the loop approximately every 11 chunks but should still take approximately 312 chunks to process the entire file 更改为524288的块大小后,大约每11个块退出一次循环,但仍应占用大约312个块来处理整个文件

If I can get it to stop at the end of each row item, I would be able to process that row and send it to the csv file based on the form shown below. 如果我可以将其停在每个行项目的末尾,则可以处理该行并将其发送到基于以下所示形式的csv文件。

vi on the small file shows that it is of the form 小文件上的vi显示它的形式

{"MainKey":[{"Fields":[{"Value": {'value':val}, 'name':'valname'}, {'Value': {'value':val}, 'name':'valname'}}], (other keys)},{'Fields' ... }] (other keys on MainKey level) }

I cannot use ijson as I must set this up for systems that I cannot import additional software for. 我无法使用ijson,因为必须为无法导入其他软件的系统进行设置。

I wound up using a chunk size of 8388608 (0x800000 hex) in order to process the files. 为了处理文件,我使用了8388608(16x0x800000 hex)的块大小结束。 I then processed the lines that had been read in as part of the loop, keeping count of rows processed and rows discarded. 然后,我处理了循环中已读取的行,并保留了已处理行数和已丢弃行数。 At each process function, I added the number to the totals so that I could keep track of total records processed. 在每个处理功能中,我将数字添加到总数中,以便可以跟踪已处理的记录总数。

This appears to be the way that it needs to go. 这似乎是它需要走的路。

Next time a question like this is asked, please emphasize that a large chunk size must be specified and not the 2048 as shown in the original answer. 下次问类似这样的问题时,请强调必须指定较大的块大小,而不是原始答案中所示的2048。

The loop goes 循环去

first = True
for data in self.json_parse(inf):
  records = len(data['MainKey'])
  columns = len(data['MainKey'][0]['Fields'])
  if first:
    # Initialize output as DictWriter
    ofile, outf, fields = self.init_csv(csvname, data, records, columns)
    first = False
  reccount, errcount = self.parse_records(outf, data, fields, records)

Within the parsing routine 在解析例程中

for rec in range(records):
  currec = data['MainKey'][rec]
  # If each column count can be different
  columns = len(currec['Fields'])
  retval, valrec = self.build_csv_row(currec, columns, fields)

To parse the columns use 要解析列,请使用

for col in columns:
  dataname = currec['Fields'][col]['name']
  dataval = currec['Fields'][col]['Values']['value']

Thus the references now work and the processing is handled correctly. 因此,参考现在可以正常工作,并且处理已正确处理。 The large chunk apparently allows the processing to be fast enough to handle the data while being small enough not to overload the system. 大块显然可以使处理足够快以处理数据,而又足够小而不会使系统过载。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM