简体   繁体   中英

Line by line check for large amount of keywords with python

I am iterating through many csv files with 1000 to 3000 lines checking each line whether one of 70000 key words is inherited in a text of 140 characters. My problem at the moment is, that my code runs extremely slow. I guess because of the many iterations. I am relatively new programer and not sure what is the best way to speed up. It took 2 hours to check one entire file and there are still many many I need to go through. My logic at the moment is: import csv as list of lists -> for each list in list take the first element and search for each of the 70000 keywords whether it is mentioned.

Currently my code looks like the following:

import re
import csv


def findname(lst_names,text):
  for name in lst_names:
  name_match = re.search(r'@'+str(name), text)
  if name_match:
    return name 

lst_users = importusr_lst('users.csv') #defined function to import 700000 keywords
lst_successes = []
with open(file, 'rb') as csvfile:
  filereader = csv.reader(csvfile, delimiter = ',')
  content = []

  for row in filereader:
    content.append(row)
  if len(content)>1:
    for row in content:
      hit = []
      mentioned = findname(lst_names, row[0]) #row[0] is the text of 140 characters

      if mentioned:
        hit = row[1:7]
        hit.append(mentioned)
        lst_successes.append(hit)

return lst_successes

The input is a list of tweets with data about this tweet. One row contains the following information:

Tweet_text,Tweet_id,Tweet_date,Tweet_fav_count,Tweet_retweet_count,Replied_to user_id,Replied_to_stats_id,author_name,user_name

One example could be:

"This is an awesome tweet @username.",576819939086041089,2015-03-14,18:59:24,0,2,4,jjwniemzok,jjwniemzok

Keywords are usernames in Twitter. Thanks for any help!

First, make lst_names into a set , if it isn't one already, in order to have expected constant time name in lst_names checks. Then for each tweet, instead of iterating through all the names and looking for them specifically, look for any name:

names_set = set(lst_names)
# ...
name_match = re.search('@(\w+)\b', text)
if name_match:
  name = name_match.group(1)
  if name in names_set:
    return name

(I'm assuming twitter names are \\w+ here).

You might also want to compile the regex in advance; see Tomalak's answer.

I will make some test tweet data. I assume here that a tweet user name is immediately preceded by an '@' symbol in the tweet text, eg a tweet might read 'something cool @someone1 @someone2 something else cool @someone3' . I will make some test data:

import numpy as np
import string
tweet_templates = [['askdjaklsd {0} akdjsakd {1}', 2], ['alskalkals {0}',1], ['{0} kadksjdss {1} {2}',3]  ]
some_names      = array( [ '@'+''.join( random.sample( string.letters , 5) ) for i in xrange( 70000 )] ) # large number of poss user names
template_i      = np.random.randint( 0,3,30000 ) # 30000 tweets
tweets          = [ tweet_templates[t][0].format( *some_names[  np.random.randint( 0 ,len(some_names ), tweet_templates[t][1] )] ) for t in template_i]

In your case when loading the text from csv I would use numpy.loadtxt (personal choice):

#tweet_data = np.loadtxt( 'tweet_file.csv', delimiter=',', dtype=str) 
# there are options to ignore headers etc.
#tweets = tweet_data[:,0] # first column

Now, that we have data, isolate the names in each row:

tweets_split = map( lambda x : x.split(), tweets )
tweet_names = map( lambda y: filter( lambda x : x[0] == '@', y ), tweets_split )
print tweet_names
#[['@msUnu', '@KvUqA'], ['@GknKr'], ['@Hxbfe'],  ...
tweet_names = map( lambda y: map( lambda x : x.split('@')[-1], y ), tweet_names )
print tweet_names
#[['msUnu', 'KvUqA'], ['GknKr'], ['Hxbfe'], 

Then make a list where each element is a sublist [ name, tweet_row] where name is the twitter user's name, and tweet_row is the row where the name was found in the tweets data.

tweet_names_info = [ map( lambda n : [ n,ind ] , tweet_names[ind] ) for ind in xrange( len(tweets) ) ]
tweet_names_info = [ sl for sublist in tweet_names_info for sl in sublist]

Group this list according to the names:

from itertools import groupby
tweet_names_grouped = [ [k, list( np.array(list(g))[:,1].astype(int))] for k,g in groupby( tweet_names_info, lambda x: x[0] ) ]
tweet_names_rows  = dict( tweet_names_grouped )

Now you have a dictionary where the keys are the twitter user names, and the values are the row numbers of the corresponding tweets. It should be easy to compare this dictionary object to your list of users:

tweeters = tweet_names_rows.keys()
#lst_users = importusr_lst('users.csv') 
#^ your function, I assume it loads a 1D array, I will make up some user names
lst_users  = array( [ ''.join( random.sample( string.letters , 5) ) for i in xrange( 130000 )] )
users_who_tweeted = list( set(tweeters).intersection(set(lst_users)) )

if users_who_tweeted:
    for u in users_who_tweeted:
        u_text = [  tweets[i]  for i in tweet_names_rows[u] ]
        print 'USER %s WAS ON TWITTER:'% u
        print '\n'.join( u_text), '\n'

#USER ZeHLA WAS ON TWITTER:
#alskalkals @ZeHLA 

#USER jaZuG WAS ON TWITTER:
#@mjLTG kadksjdss @jaZuG @DJNjv 

#USER UVzSs WAS ON TWITTER:
#@tnGrH kadksjdss @DOBij @UVzSs 
#...
#...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM