相同的代码给出不同的输出取决于它是否具有列表推导或生成器

Question

I am trying to clean this website and get every word. 我正在尝试清理此网站并获得每个字。 But using generators gives me more words than using lists. 但是使用生成器比使用列表给我更多的词汇。 Also, these words are inconsistent. 而且，这些词是不一致的。 Sometimes I have more 1 words, sometimes none, sometimes more than 30 words. 有时我有1个以上的单词，有时没有，有时超过30个单词。 I have read about generators on python documentation and looked up some questions about generators. 我已在python文档上阅读了有关生成器的内容，并查找了一些有关生成器的问题。 What i understand is it shouldn't differ. 我了解的是它应该没有什么不同。 I don't understand what's going on underneath the hood. 我不明白引擎盖下面发生了什么。 I am using python 3.6. 我正在使用python 3.6。 Also I have read Generator Comprehension different output from list comprehension? 我也从列表理解中阅读了生成器理解的不同输出吗？ but I can't understand the situation. 但我不明白这种情况。

This is first function with generators. 这是发电机的第一个功能。

def text_cleaner1(website):
    '''
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        site = requests.get(url).text # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 

    soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site


    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object

    text = soup_obj.get_text() # Get the text from this

    lines = (line.strip() for line in text.splitlines()) # break into lines

    print(type(lines))

    chunks = (phrase.strip() for line in lines for phrase in line.split("  ")) # break multi-headlines into a line each

    print(type(chunks))

    def chunk_space(chunk):
        chunk_out = chunk + ' ' # Need to fix spacing issue
        return chunk_out  

    text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line

    # Now clean out all of the unicode junk (this line works great!!!)


    try:
        text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
    except:                                                            # in a way that this works, can occasionally throw
        return                                                         # an exception  

    text = str(text)

    text = re.sub("[^a-zA-Z.+3]"," ", text)  # Now get rid of any terms that aren't words (include 3 for d3.js)
                                             # Also include + for C++


    text = text.lower().split()  # Go to lower case and split them apart


    stop_words = set(stopwords.words("english")) # Filter out any stop words
    text = [w for w in text if not w in stop_words]



    text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
                            # or not on the website)

    return text

This is second function with list comprehensions. 这是具有列表理解功能的第二个功能。

def text_cleaner2(website):
    '''
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        site = requests.get(url).text # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 

    soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site


    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object

    text = soup_obj.get_text() # Get the text from this

    lines = [line.strip() for line in text.splitlines()] # break into lines

    chunks = [phrase.strip() for line in lines for phrase in line.split("  ")] # break multi-headlines into a line each

    def chunk_space(chunk):
        chunk_out = chunk + ' ' # Need to fix spacing issue
        return chunk_out  

    text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line

    # Now clean out all of the unicode junk (this line works great!!!)


    try:
        text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
    except:                                                            # in a way that this works, can occasionally throw
        return                                                         # an exception  

    text = str(text)

    text = re.sub("[^a-zA-Z.+3]"," ", text)  # Now get rid of any terms that aren't words (include 3 for d3.js)
                                             # Also include + for C++


    text = text.lower().split()  # Go to lower case and split them apart


    stop_words = set(stopwords.words("english")) # Filter out any stop words
    text = [w for w in text if not w in stop_words]



    text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
                            # or not on the website)

    return text

And this code give me different results randomly. 这段代码随机给我不同的结果。

text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")

Answer 1

Generator is "lazy" - it doesn't execute code immediately but it executes it later when results will be needed. 生成器是"lazy" -它不会立即执行代码，但是会在需要结果时稍后执行。 It means it doesn't get values from variables or functions immediately but it keeps references to variables and functions. 这意味着它不会立即从变量或函数中获取值，但会保留对变量和函数的引用。

Example from link 链接示例

all_configs = [
    {'a': 1, 'b':3},
    {'a': 2, 'b':2}
]
unique_keys = ['a','b']


for x in zip( *([c[k] for k in unique_keys] for c in all_configs) ):
    print(x)

print('---')
for x in zip( *((c[k] for k in unique_keys) for c in all_configs) ):
    print(list(x))

In generator there is for loop inside another for loop. 在发电机有for内部另一个循环for循环。

Internal generator gets reference to c instead of value in c and it will get value later. 内部生成器获取对c引用，而不是c值，稍后它将获取值。

Later (when it has to get results from generators) it starts execution with external generator for c in all_configs . 稍后（当它必须从生成器获取结果时），它将开始使用for c in all_configs外部生成器执行。 When external generator is executed it loops and generates two internal geneartors which use reference to c , not value from c , but when it loops it also changes value in c - so finally you have "list" with two internal generators and {'a': 2, 'b':2} in c . 当执行外部发生器它循环并产生使用引用两个内部geneartors c ，而不是从价值c ，但是当它循环也改变价值c -所以最后你有“名单”有两个内部发生器和{'a': 2, 'b':2}在c 。

After that it executes internals generators which finally get value from c but in this moment c already has {'a': 2, 'b':2} . 之后，它执行内部生成器，这些生成器最终从c获得值，但是此时c已经具有{'a': 2, 'b':2} 。

BTW: there is similar problem with lambda in for loop when you use it with Buttons in tkinter . BTW：有类似的问题与lambda在for循环，当你在按钮使用tkinter 。

相同的代码给出不同的输出取决于它是否具有列表推导或生成器

问题描述

1 个解决方案

解决方案1
0 2017-12-24 16:41:41

相同的代码给出不同的输出取决于它是否具有列表推导或生成器

问题描述

1 个解决方案

解决方案1 0 2017-12-24 16:41:41

解决方案1
0 2017-12-24 16:41:41