简体   繁体   English

如何遍历 python 列表,并在继续该过程之前停止加载下一个 URL?

[英]How can I iterate through the python list, and stop to load the next URL before continuing the process?

I have learned to accomplish (using python) building some different web scrapers with the purpose of scraping image URLs from one of our part manufacturer's websites for the sake of mass uploading a load sheet of products, with one of the columns consisting of the image URLs.我已经学会了完成(使用 python)构建一些不同的网络抓取工具,目的是从我们的零件制造商的网站上抓取图像 URL,以便大量上传产品的负载表,其中一列由图像 URL 组成.

Since the URLs aren't simple (I can't simply iterate through a list of product numbers and append it to each new URL nor any of the simpler methods; I'm here because I have to be here) and since the site doesn't have a "search by product number" function, I went to the lists on their site.由于 URL 并不简单(我不能简单地遍历产品编号列表并将其附加到每个新 URL 或任何更简单的方法;我在这里是因为我必须在这里)并且因为该站点没有没有“按产品编号搜索”功能,我去了他们网站上的列表。 They had some really handy tools!他们有一些非常方便的工具! You can add products by product number, and when you're done you can export that list as a .csv with the option to include the links to all of the corresponding product pages.您可以按产品编号添加产品,完成后您可以将该列表导出为.csv ,并可以选择包含指向所有相应产品页面的链接。 Which was great, until I built my script and found out the hard way that they have a 250 item limit per list.这很棒,直到我构建了我的脚本并发现每个列表的项目限制为 250 的艰难方法。 For perspective, I have a little under 5,000 products to scrape (meaning I will need about 20 lists, with 19 full and the last one nearly full).从角度来看,我有不到 5,000 种产品要抓取(这意味着我需要大约 20 个列表,其中 19 个已满,最后一个几乎已满)。

I mention all of this as the context for it is relevant to the code and issue at hand.我提到所有这些,因为它的上下文与手头的代码和问题相关。

My goal now that I have really no other options is to take my code and modify it a bit to achieve the scraping through 20 separate lists.现在我真的没有其他选择了,我的目标是使用我的代码并稍微修改它以实现抓取 20 个单独的列表。 Right now, at the stage that is relevant, it gets a URL that goes to the link of their website for a list I have named testlist and it then refreshes the page just to make sure all of the elements are in order.现在,在相关阶段,它会获取一个 URL,该 URL 指向我命名为testlist的列表的网站链接,然后它刷新页面以确保所有元素都按顺序排列。

We were on the right page when I needed one list, but there's issue one: We can't just use one link anymore, as we will have to set something up to iterate through 250 items and create a new list about 20 times (or I can manually create the lists and have specific URLs to point to).当我需要一个列表时,我们在正确的页面上,但有一个问题:我们不能再只使用一个链接,因为我们必须设置一些东西来迭代 250 个项目并创建一个新列表大约 20 次(或我可以手动创建列表并有特定的 URL 指向)。

Our second issue at hand is the item limit itself.我们手头的第二个问题是项目限制本身。 My for loop is one large one designed to iterate through the entire list of about 4,800 product numbers that I have, adding them one by one into the list on the same page.我的 for 循环是一个很大的循环,旨在遍历我拥有的大约 4,800 个产品编号的整个列表,将它们一个一个地添加到同一页面上的列表中。 We need to break this up into chunks of 250 items per page, at most, and have it load up another list URL.我们需要将其分解为每页最多 250 个项目的块,并让它加载另一个列表 URL。 I could go create those lists manually so that I would have specific URLs to point to, but, if it will be easier to add a function that just clicks and names it, that would be awesome.可以手动创建这些列表,以便我可以指向特定的 URL,但是,如果添加一个只需单击并命名的函数会更容易,那就太棒了。 I can figure that part out myself, probably.我可以自己弄清楚那部分,可能。

I don't know where to go from here.我不知道从这里去哪里。 I have code that is built to handle one website list, on one URL, iterating through the product numbers in my python list, and then export that at the end.我的代码用于处理一个网站列表,在一个 URL 上,遍历我的 Python 列表中的产品编号,然后在最后将其导出。

I need my script to iterate through the same python list, stopping after 250 product numbers to load the next URL before continuing the process.我需要我的脚本遍历同一个 python 列表,在 250 个产品编号后停止以加载下一个 URL,然后再继续该过程。

The part of my code that gets our list URL and then onwards into the scraper portion is as follows.我的代码中获取我们的列表 URL 然后继续进入刮刀部分的部分如下。


# get the url for our list
listurl = 'https://www.thewebsiteimscraping.com/products/list-manager?listid=3925' # <- this is the URL for one particular list; other lists will have different list IDs
alert_accept()
driver.get(listurl)
alert_accept()

############################################################################################

driver.refresh()
# import our list, the Select function, the By function for selections, expected conditions, and our time function so we can sleep 
from kiberlist import mfrnumbers
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
# from testnumber import testnumbers as tlnum 


for number in mfrnumbers:
    
        # we find the listactions menu, and utilize the "add item" option
        WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#listActions")))
        alert_accept()
        print('Finding listactions...')
        select_am = Select(driver.find_element_by_css_selector('#listActions'))
        alert_accept()
        print("Found it. Selecting...")
        select_am.select_by_value('addItems')
        print('Selected. Next...')
        
        # paste our item number into the box paste it 
        print('Locating model number search....')
        inputidbox = driver.find_element_by_id('model-number-search')
        print('Located? Pasting model number...')
        inputidbox.send_keys(number)  
        
        # finally add our item
        additembutton = driver.find_element_by_css_selector('.gtmAddItemToList')
        print('Located add item button...')
        additembutton.click()   
        print('Item number added. Next...')
        print('Locating blank space...')
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#addItemsToListModal > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > button:nth-child(1) > svg:nth-child(1) > path:nth-child(1)")))
        time.sleep(1)
        xbutton = driver.find_element_by_css_selector('#addItemsToListModal > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > button:nth-child(1) > svg:nth-child(1) > path:nth-child(1)')
        xbutton.click()
        time.sleep(1)

 
        
# now we find the "export excel" option to get our csv for that list  
listactions = Select(driver.find_element_by_css_selector('#listActions'))
listactions.select_by_value('exportExcel')

# clicky clicky. user dialog will show up on screen asking if you want to save the file. user must manually click on save 
exportbutton = driver.find_element_by_css_selector('#btnExportToExcel')
exportbutton.click()

My question is, how can I rearrange and/or modify this code to accomplish what I need?我的问题是,如何重新排列和/或修改此代码以完成我需要的操作? Is this the most efficient method of accomplishing this?这是实现这一目标的最有效方法吗? What would you do, how would you handle this, and what code can I implement in order to achieve my goal if there are no better options?如果没有更好的选择,你会怎么做,你会如何处理这个问题,以及我可以实现哪些代码来实现我的目标?

It would be pretty useless to share the actual website links as you need an account with them in order to access lists and such.共享实际的网站链接是非常无用的,因为您需要一个帐户才能访问列表等。

It seems like you have code that works for a single list, and now you just want it to work over smaller portions of that list.似乎您有适用于单个列表的代码,现在您只希望它适用于该列表的较小部分。

Usually you see "convert list of lists into one flat list".通常您会看到“将列表列表转换为一个平面列表”。 This is the opposite.这是相反的。

I'm assuming mfrnumbers is your flat list.我假设 mfrnumbers 是您的平面列表。 We'll create a generator function that given one flattened_list, it returns the list_id and the elements in that list.我们将创建一个生成器函数,给定一个 flattened_list,它返回list_id和该列表中的元素。 As stated in your question, you'll figure out how to actually get that list.如您的问题所述,您将弄清楚如何实际获得该列表。 So for now, I'm assuming the list_id is a simple integer.所以现在,我假设list_id是一个简单的整数。

This function get_list(mfrnumbers) will return those numbers in groups of max_items_per_list .此函数get_list(mfrnumbers)将返回max_items_per_list组中的这些数字。 Technically, it returns an iterator that you will iterate over.从技术上讲,它返回一个迭代器,您将对其进行迭代。

def get_list(flattened_list, max_items_per_list=250):
    # maybe you have some pattern for list names?
    list_id = 1

    while len(flattened_list) > 0:
        current_list = flattened_list[:max_items_per_list]
        yield list_id, current_list

        flattened_list = flattened_list[len(current_list):]
        list_id += 1

And we can call this function as follows:我们可以这样调用这个函数:

for (myid, mylist) in get_list([1,2,3,4,5], max_items_per_list=2):
    print (myid, mylist)

Output:输出:

1 [1, 2]
2 [3, 4]
3 [5]

So in your case, you would run your entire big loop for number in mfrnumbers as an inner loop but with the output of get_list .因此,在您的情况下,您可以将for number in mfrnumbers中的for number in mfrnumbers整个大循环作为内部循环运行,但使用get_list的输出。

for (myid, mylist) in get_list(mfrnumbers):
    # stop and do any loading for this new list...
    for number in mylist:
       .....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM