Preprocessing 400 million tweets in Python — faster

Question

I have 400 million tweets ( actually I think its almost like 450 but never mind ) , in the form :

T    "timestamp"
U    "username"
W    "actual tweet"

I want to write them to a file initially in the form "username \\t tweet" and then load into a DB . The problem is that before loading into a db, there are a few things I need to do : 1. Preprocess the tweet to remove RT@[names] and urls 2. Take the username out of "http://twitter.com/username".

I am using python and this is the code . Please let me know how this can be made faster :)

'''The aim is  to take all the tweets of a user and store them in a table.  Do this for all the users and then lets see what we can do with it 
   What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started 
'''
def regexSub(line):
    line = re.sub(regRT,'',line)
    line = re.sub(regAt,'',line)
    line = line.lstrip(' ')
    line = re.sub(regHttp,'',line)
    return line
def userName(line):
    return line.split('http://twitter.com/')[1]


import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT 
regRT = 'RT'
global regHttp 
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt 
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')

for line1,line2,line3 in itertools.izip_longest(*[data]*3):
    line1 = line1.split('\t')[1]
    line2 = line2.split('\t')[1]
    line3 = line3.split('\t')[1]

    #print 'line1',line1
    #print 'line2=',line2
    #print 'line3=',line3
    #print 'line3 before preprocessing',line3
    try:
        tweet=regexSub(line3)
        user = userName(line2)
    except:
        print 'Line2 is ',line2
        print 'Line3 is',line3

    #print 'line3 after processig',line3
    processed.write(user.strip("\n")+"\t"+tweet)

I ran the code in the following manner:

python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt

This is the output I get : ( Warning : its pretty big and I did not filter out the small values, thinking, they may also hold some significance )

Sat Jan  7 03:28:51 2012    profile_dump

         3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
528840744  166.402    0.000  166.402    0.000 {method 'split' of 'str' objects}
396630560   81.300    0.000   81.300    0.000 {method 'get' of 'dict' objects}
396630560  326.349    0.000  439.737    0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558  255.662    0.000 1297.705    0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558  602.307    0.000  602.307    0.000 {built-in method sub}
264420442   32.087    0.000   32.087    0.000 {isinstance}
132210186   34.700    0.000   34.700    0.000 {method 'lstrip' of 'str' objects}
132210186   27.296    0.000   27.296    0.000 {method 'strip' of 'str' objects}
132210186  181.287    0.000 1513.691    0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186   79.950    0.000   79.950    0.000 {method 'write' of 'file' objects}
132210186   55.900    0.000  113.960    0.000 TwitterScripts/Preprocessing.py:10(userName)
  313/304    0.000    0.000    0.000    0.000 {len}

Removed the ones which were really low ( like 1, 3 and so on)

Please tell me what other changes can be made. Thanks !

Answer 1

This is what multiprocessing is for.

You have a pipeline that can be broken into a large number of small steps. Each step is a Process which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.

You'll have a Process which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.

You'll have a Process which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.

Etc., etc.

Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing module. After that, it's an empirical study to find out what the optimum mix of processing steps is.

When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.

Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.

Answer 2

Until you run it through a profiler, it is difficult to know what needs to be changed. However, I would suggest that the most likely slowdowns occur where you are creating and running the regular expressions.

Since your file follows a specific format, you may see significant speed increases by using a lex+yacc combo. If you use python lex+yacc , you won't see as much of a speed increase, but you won't need to muck about with c code.

If this seems like overkill, try compiling the regular expressions before you start the loop. You can also have chunks of the file run by independent worker threads/processes.

Again though, profiling will reveal what actually is causing the bottleneck. Find that out first, then see if these options will solve the problem.

Answer 3

str.lstrip is probably not doing what you were expecting:

>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'

from the docs:

S.lstrip([chars]) -> string or unicode

Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping

Answer 4

Looking at the profiling information, you're spending a lot of time in regexSub. You may find that you can combine your regexps into a single one, and do a single substitution.

Something like:

regAll = re.compile(r'RT|(^[ \t]+)|((http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?)|...')

(The intention of this is to not only replace all the things you are doing with re.sub, but also the lstrip). I've ended the pattern with ...: you'll have to fill in the details yourself.

Then replace regexSub with just:

line = regAll.sub(line)

Of course, only profiling will show if this is faster, but I expect that it will as there will be fewer intermediate strings being generated.

Preprocessing 400 million tweets in Python — faster

Question

4 answers

solution1
7 ACCPTED 2012-01-06 20:43:21

solution2
3 2012-01-06 20:42:38

solution3
3 2012-01-06 20:44:58

solution4
1

Preprocessing 400 million tweets in Python — faster

Question

4 answers

solution1 7 ACCPTED 2012-01-06 20:43:21

solution2 3 2012-01-06 20:42:38

solution3 3 2012-01-06 20:44:58

solution4 1

solution1
7 ACCPTED 2012-01-06 20:43:21

solution2
3 2012-01-06 20:42:38

solution3
3 2012-01-06 20:44:58

solution4
1