繁体   English   中英

处理 python (>60gb) 中的大文件

[英]Processing a big file in python (>60gb)

我有一个文本文件(> = 60Gig) ,其中的记录是这样的:

{"index": {"_type": "_doc", "_id": "bLcy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2135,\"id\":816704468,\"access_hash\":\"788468819702098896\",\"first_name\":\"a\",\"last_name\":\"b\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusOffline\",\"was_online\":132}}","phone":"12","@version":"1","typ":"telegram_contacts","access_hash":"123","id":816704468,"@timestamp":"2020-01-26T13:53:29.467Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","type":"redis","flags":2135,"host":"ubuntu","imported_from":"telegram_contacts"}

{"index": {"_type": "_doc", "_id": "Z7cy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2143,\"id\":323586643,\"access_hash\":\"8315858910992970114\",\"first_name\":\"bv\",\"last_name\":\"nj\",\"username\":\"kj\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusRecently\"}}","phone":"123","@version":"1","typ":"telegram_contacts","access_hash":"8315858910992970114","id":323586643,"@timestamp":"2020-01-26T13:53:29.469Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","username":"mbnab","type":"redis","flags":2143,"host":"ubuntu","imported_from":"telegram_contacts"}

我对此有几个问题:

  1. 这是一个有效的 JSON 文件吗?
  2. python 可以处理这种大小的文件吗? 或者我应该以某种方式将其转换为 Access 或 Excel 文件?

这些是我发现有用的一些 SO 帖子:

但仍然需要帮助。

您可以逐行处理文件并提取所需的信息。

with open('largefile.txt','r') as f:
    for line in f:
        # Extract what you need from that line of text here
        print(line)

例如,要阅读内容,您可以逐行处理文件并提取所需的信息。

with open('largefile.txt','r') as f:
    for line in f:
        # For example, to interpret the string as json, and read 
        # it in as a dictionary, do 
        if line.strip():  # check there is something on the line
            data = json.loads(line)
            # in your case, to fix the value for "message" do
            if 'message' in data: 
                data['message'] = json.loads(data['message']) 
            # extract information you need here

我希望有更多的工作来提取您需要的信息,但我希望这能让您开始。 祝你好运!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM