簡體   English   中英

處理 python (>60gb) 中的大文件

[英]Processing a big file in python (>60gb)

我有一個文本文件(> = 60Gig) ,其中的記錄是這樣的:

{"index": {"_type": "_doc", "_id": "bLcy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2135,\"id\":816704468,\"access_hash\":\"788468819702098896\",\"first_name\":\"a\",\"last_name\":\"b\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusOffline\",\"was_online\":132}}","phone":"12","@version":"1","typ":"telegram_contacts","access_hash":"123","id":816704468,"@timestamp":"2020-01-26T13:53:29.467Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","type":"redis","flags":2135,"host":"ubuntu","imported_from":"telegram_contacts"}

{"index": {"_type": "_doc", "_id": "Z7cy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2143,\"id\":323586643,\"access_hash\":\"8315858910992970114\",\"first_name\":\"bv\",\"last_name\":\"nj\",\"username\":\"kj\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusRecently\"}}","phone":"123","@version":"1","typ":"telegram_contacts","access_hash":"8315858910992970114","id":323586643,"@timestamp":"2020-01-26T13:53:29.469Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","username":"mbnab","type":"redis","flags":2143,"host":"ubuntu","imported_from":"telegram_contacts"}

我對此有幾個問題:

  1. 這是一個有效的 JSON 文件嗎?
  2. python 可以處理這種大小的文件嗎? 或者我應該以某種方式將其轉換為 Access 或 Excel 文件?

這些是我發現有用的一些 SO 帖子:

但仍然需要幫助。

您可以逐行處理文件並提取所需的信息。

with open('largefile.txt','r') as f:
    for line in f:
        # Extract what you need from that line of text here
        print(line)

例如,要閱讀內容,您可以逐行處理文件並提取所需的信息。

with open('largefile.txt','r') as f:
    for line in f:
        # For example, to interpret the string as json, and read 
        # it in as a dictionary, do 
        if line.strip():  # check there is something on the line
            data = json.loads(line)
            # in your case, to fix the value for "message" do
            if 'message' in data: 
                data['message'] = json.loads(data['message']) 
            # extract information you need here

我希望有更多的工作來提取您需要的信息,但我希望這能讓您開始。 祝你好運!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM