[英]Same code gives different output depends whether it has list comprehensions or generators
我正在尝试清理此网站并获得每个字。 但是使用生成器比使用列表给我更多的词汇。 而且,这些词是不一致的。 有时我有1个以上的单词,有时没有,有时超过30个单词。 我已在python文档上阅读了有关生成器的内容,并查找了一些有关生成器的问题。 我了解的是它应该没有什么不同。 我不明白引擎盖下面发生了什么。 我正在使用python 3.6。 我也从列表理解中阅读了生成器理解的不同输出吗? 但我不明白这种情况。
这是发电机的第一个功能。
def text_cleaner1(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = (line.strip() for line in text.splitlines()) # break into lines
print(type(lines))
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
print(type(chunks))
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
这是具有列表理解功能的第二个功能。
def text_cleaner2(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = [line.strip() for line in text.splitlines()] # break into lines
chunks = [phrase.strip() for line in lines for phrase in line.split(" ")] # break multi-headlines into a line each
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
这段代码随机给我不同的结果。
text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")
生成器是"lazy"
-它不会立即执行代码,但是会在需要结果时稍后执行。 这意味着它不会立即从变量或函数中获取值,但会保留对变量和函数的引用。
链接示例
all_configs = [
{'a': 1, 'b':3},
{'a': 2, 'b':2}
]
unique_keys = ['a','b']
for x in zip( *([c[k] for k in unique_keys] for c in all_configs) ):
print(x)
print('---')
for x in zip( *((c[k] for k in unique_keys) for c in all_configs) ):
print(list(x))
在发电机有for
内部另一个循环for
循环。
内部生成器获取对c
引用,而不是c
值,稍后它将获取值。
稍后(当它必须从生成器获取结果时),它将开始使用for c in all_configs
外部生成器执行。 当执行外部发生器它循环并产生使用引用两个内部geneartors c
,而不是从价值c
,但是当它循环也改变价值c
-所以最后你有“名单”有两个内部发生器和{'a': 2, 'b':2}
在c
。
之后,它执行内部生成器,这些生成器最终从c
获得值,但是此时c
已经具有{'a': 2, 'b':2}
。
BTW:有类似的问题与lambda
在for
循环,当你在按钮使用tkinter
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.