![](/img/trans.png)
[英]Most Efficient Way to iteratively filter a Pandas dataframe given a list of values
[英]What's the most efficient way to filter values out of a list based on the values in another list
我目前創建了這樣的列表:
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
def tokenize(string):
""" An implementation of input string tokenization that excludes stopwords
Args:
string (str): input string
Returns:
list: a list of tokens without stopwords
"""
res = list()
for word in simpleTokenize(string):
if word not in stopwords:
res.append(word)
return res
simpleTokenize只是字符串的基本拆分函數,它返回字符串列表。
這可以。 如果您想以更“ Pythonic”的方式(一行代碼而不是4行)來實現,則可以使用列表推導:
res = [word for word in simpleTokenize(string) if word not in stopwords]
您已經使用了最大加速潛力的set
(基於我希望您的代碼list.__contains__
測試的問題標題)。 我唯一可以建議的就是使函數成為生成器,因此您不需要創建res
列表:
def tokenize(text):
for word in simpleTokenize(string):
if word not in stopwords:
yield word
您可以使用過濾器功能
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
def tokenize(string):
""" An implementation of input string tokenization that excludes stopwords
Args:
string (str): input string
Returns:
list: a list of tokens without stopwords
"""
return filter(lambda x:x not in stopwords, simpleTokenize(string))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.