I currently created a list like this:
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
def tokenize(string):
""" An implementation of input string tokenization that excludes stopwords
Args:
string (str): input string
Returns:
list: a list of tokens without stopwords
"""
res = list()
for word in simpleTokenize(string):
if word not in stopwords:
res.append(word)
return res
simpleTokenize is just a basic split function on the string which returns a list of strings.
This is fine. If you want to do it in a more "Pythonic" way (one line of code instead of 4) you could use a list comprehension:
res = [word for word in simpleTokenize(string) if word not in stopwords]
You already are using a set
which is the biggest potential speedup (based on the question title I was expecting your code to have a list.__contains__
test). The only remaining thing I can suggest is making your function a generator, so you don't need to create the res
list:
def tokenize(text):
for word in simpleTokenize(string):
if word not in stopwords:
yield word
You can use filter function
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
def tokenize(string):
""" An implementation of input string tokenization that excludes stopwords
Args:
string (str): input string
Returns:
list: a list of tokens without stopwords
"""
return filter(lambda x:x not in stopwords, simpleTokenize(string))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.