I'm trying to read from a file that is currently being written to at high speeds by a different python script. There are ~ 70,000 lines in the file. When I try to read in the lines, I generally get to ~7,750 before my application exits.
I think this is due to the file being written to (append only). I have processed larger files (20k lines), but only while not being written to.
What steps can I take to troubleshoot further? How can I read from this file, despite it currently being written to?
I'm new-ish to Python. Any/all help is appreciated.
tweets_data = []
tweets_file = open(tweets_data_path, "r")
i = 0
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
i += 1
if i % 250 == 0:
print i
except:
continue
## Total # of tweets captured
print len(tweets_data)
Traceback: I get this for every read
ValueError: No JSON object could be decoded
Traceback (most recent call last):
File "data-parser.py", line 33, in <module>
tweet = json.loads(line)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
UPDATE:
I've modified my code to follow the suggestions put forth by @JanVlcinsky. I've identified that the issue is not that the file is being written to. In the below code, if I comment out tweets_data.append(tweet)
, or if I add a condition so that tweets are only added to the array half as often, my program works as expected. However, if I try to read in all ~90,000 lines, My application exits prematurely.
tweets_data = []
with open(tweets_data_path) as f:
for i, line in enumerate(f):
if i % 1000 == 0:
print "line check: ", str(i)
try:
## Skip "newline" entries
if i % 2 == 1:
continue
## Load tweets into array
tweet = json.loads(line)
tweets_data.append(tweet)
except Exception as e:
print e
continue
## Total # of tweets captured
print "decoded tweets: ", len(tweets_data)
print str(tweets_data[0]['text'])
Premature Exit Output:
When loading every valid line into the array...
...
line check: 41000
line check: 42000
line check: 43000
line check: 44000
line check: 45000
dannyb@twitter-data-mining:/var/www/cmd$
When loading every other valid line into the array...
...
line check: 86000
line check: 87000
line check: 88000
dannyb@twitter-data-mining:/var/www/cmd$
When loading every third valid line into the array...
...
line check: 98000
line check: 99000
line check: 100000
line check: 101000
decoded tweets: 16986
Ultimately leaving me to believe the issue is related to the size of the array and my available resources? (On a VPS w/ 1GB RAM)
FINAL: Doubling the RAM fixed this issue. It appears that my Python script was exceeding the amount of RAM made available to it. As a follow-up, I've started looking at ways to improve in-memory RAM efficiency, and ways to increase the total amount of RAM available to my script.
I think, that your plan to read tweets from continually appended file shall work.
There will be probably some surprises in your code as you will see.
Modify your code as follows:
import json
tweets_data = []
with open("tweets.txt") as f:
for i, line in enumerate(f):
if i % 250 == 0:
print i
line = line.strip()
# skipping empty lines
if not len(line):
continue
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except MemoryError as e:
print "Run out of memory, bye."
raise e
except Exception as e:
print e
continue
## Total # of tweets captured
print "decoded tweets", len(tweets_data)
The modifications:
with open...
: this is just good habit to close the file regardless of what will happen after opening it. for i, line in enumerate(f):
- the enumerate
will generate growing number for each item from iterated f json.loads
, you could miss counting lines, which failed decoding. except Exception as e:
it is bad habit to catch whatever exception as you did before, as the valuable information about the problem is hidden from your eyes. You will see in your real run that printed exceptions will help you to understand the problem. EDIT: added skipping empty lines (not reyling on having empty lines being regularly present.
Aslo added direct catch for MemoryError
to complain in case, we run out of RAM.
EDIT2: rewrite to use list comprehension (not sure, if this would optimize used RAM). It assumes, all non-empty lines are valid JSON strings and it does not print reports about progressing:
import json
with open("tweets.txt") as f:
tweets_data = [json.loads(line)
for line in f
if len(line.strip())]
## Total # of tweets captured
print "decoded tweets", len(tweets_data)
It will probably run faster then the previous version as there are no append
operations.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.