简体   繁体   中英

Processing a big file in python (>60gb)

i have a text file (>= 60Gig) and record's in it are like this:

{"index": {"_type": "_doc", "_id": "bLcy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2135,\"id\":816704468,\"access_hash\":\"788468819702098896\",\"first_name\":\"a\",\"last_name\":\"b\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusOffline\",\"was_online\":132}}","phone":"12","@version":"1","typ":"telegram_contacts","access_hash":"123","id":816704468,"@timestamp":"2020-01-26T13:53:29.467Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","type":"redis","flags":2135,"host":"ubuntu","imported_from":"telegram_contacts"}

{"index": {"_type": "_doc", "_id": "Z7cy4m8BAObvGO9GALME"}}
{"message":"{\"_\":\"user\",\"pFlags\":{\"contact\":true},\"flags\":2143,\"id\":323586643,\"access_hash\":\"8315858910992970114\",\"first_name\":\"bv\",\"last_name\":\"nj\",\"username\":\"kj\",\"phone\":\"123\",\"status\":{\"_\":\"userStatusRecently\"}}","phone":"123","@version":"1","typ":"telegram_contacts","access_hash":"8315858910992970114","id":323586643,"@timestamp":"2020-01-26T13:53:29.469Z","path":"/home/user/mirror_01/users_5d6ca02e7e736a7fc700df8c.log","username":"mbnab","type":"redis","flags":2143,"host":"ubuntu","imported_from":"telegram_contacts"}

I have a few questions regarding this:

  1. Is this a valid JSON file?
  2. Can python process a file of this size? Or should I convert it somehow to Access or Excel file?

These are some SO posts I found useful:

But still need help.

You can work through the file line by line and extract the information you need.

with open('largefile.txt','r') as f:
    for line in f:
        # Extract what you need from that line of text here
        print(line)

For example, to read things You can work through the file line by line and extract the information you need.

with open('largefile.txt','r') as f:
    for line in f:
        # For example, to interpret the string as json, and read 
        # it in as a dictionary, do 
        if line.strip():  # check there is something on the line
            data = json.loads(line)
            # in your case, to fix the value for "message" do
            if 'message' in data: 
                data['message'] = json.loads(data['message']) 
            # extract information you need here

I expect there's a lot more work to extract the information you need, but I hope this gets you started. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM