Building large array of objects causes Python script to exit without logging an error

Question

I'm trying to read from a file that is currently being written to at high speeds by a different python script. There are ~ 70,000 lines in the file. When I try to read in the lines, I generally get to ~7,750 before my application exits.

I think this is due to the file being written to (append only). I have processed larger files (20k lines), but only while not being written to.

What steps can I take to troubleshoot further? How can I read from this file, despite it currently being written to?

I'm new-ish to Python. Any/all help is appreciated.

tweets_data = []
tweets_file = open(tweets_data_path, "r")
i = 0
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
        i += 1
        if i % 250 == 0:
            print i
    except:
        continue

## Total # of tweets captured
print len(tweets_data)

Python 2.7
Ubuntu 14.04

Traceback: I get this for every read

    ValueError: No JSON object could be decoded
    Traceback (most recent call last):
       File "data-parser.py", line 33, in <module>
         tweet = json.loads(line)
       File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
         return _default_decoder.decode(s)
       File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
       File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
         raise ValueError("No JSON object could be decoded")

UPDATE:

I've modified my code to follow the suggestions put forth by @JanVlcinsky. I've identified that the issue is not that the file is being written to. In the below code, if I comment out tweets_data.append(tweet) , or if I add a condition so that tweets are only added to the array half as often, my program works as expected. However, if I try to read in all ~90,000 lines, My application exits prematurely.

    tweets_data = []
    with open(tweets_data_path) as f:
        for i, line in enumerate(f):
            if i % 1000 == 0:
                print "line check: ", str(i)
            try:
                ## Skip "newline" entries
                if i % 2 == 1:
                    continue
                ## Load tweets into array
                tweet = json.loads(line)
                tweets_data.append(tweet)
            except Exception as e:
                print e
                continue

    ## Total # of tweets captured
    print "decoded tweets: ", len(tweets_data)
    print str(tweets_data[0]['text'])

Premature Exit Output:

When loading every valid line into the array...

...
line check:  41000
line check:  42000
line check:  43000
line check:  44000
line check:  45000
dannyb@twitter-data-mining:/var/www/cmd$

When loading every other valid line into the array...

...
line check:  86000
line check:  87000
line check:  88000
dannyb@twitter-data-mining:/var/www/cmd$

When loading every third valid line into the array...

...
line check:  98000
line check:  99000
line check:  100000
line check:  101000
decoded tweets:  16986

Ultimately leaving me to believe the issue is related to the size of the array and my available resources? (On a VPS w/ 1GB RAM)

FINAL: Doubling the RAM fixed this issue. It appears that my Python script was exceeding the amount of RAM made available to it. As a follow-up, I've started looking at ways to improve in-memory RAM efficiency, and ways to increase the total amount of RAM available to my script.

Answer 1

I think, that your plan to read tweets from continually appended file shall work.

There will be probably some surprises in your code as you will see.

Modify your code as follows:

import json
tweets_data = []
with open("tweets.txt") as f:
    for i, line in enumerate(f):
        if i % 250 == 0:
            print i
        line = line.strip()
        # skipping empty lines
        if not len(line):
            continue
        try:
            tweet = json.loads(line)
            tweets_data.append(tweet)
        except MemoryError as e:
            print "Run out of memory, bye."
            raise e
        except Exception as e:
            print e
            continue

## Total # of tweets captured
print "decoded tweets", len(tweets_data)

The modifications:

with open... : this is just good habit to close the file regardless of what will happen after opening it.
for i, line in enumerate(f): - the enumerate will generate growing number for each item from iterated f
moving the print of 250th line to the front. This may reveal, you really read many lines, but too many of them are not valid JSON objects. When was the print placed after the json.loads , you could miss counting lines, which failed decoding.
except Exception as e: it is bad habit to catch whatever exception as you did before, as the valuable information about the problem is hidden from your eyes. You will see in your real run that printed exceptions will help you to understand the problem.

EDIT: added skipping empty lines (not reyling on having empty lines being regularly present.

Aslo added direct catch for MemoryError to complain in case, we run out of RAM.

EDIT2: rewrite to use list comprehension (not sure, if this would optimize used RAM). It assumes, all non-empty lines are valid JSON strings and it does not print reports about progressing:

import json
with open("tweets.txt") as f:
    tweets_data = [json.loads(line)
                   for line in f
                   if len(line.strip())]

## Total # of tweets captured
print "decoded tweets", len(tweets_data)

It will probably run faster then the previous version as there are no append operations.

Building large array of objects causes Python script to exit without logging an error

Question

1 answers

solution1
1 ACCPTED 2016-01-26 22:36:58

Building large array of objects causes Python script to exit without logging an error

Question

1 answers

solution1 1 ACCPTED 2016-01-26 22:36:58

solution1
1 ACCPTED 2016-01-26 22:36:58