简体   繁体   中英

Looping through multiple URL's overwriting data

I have successfully managed to pull all information necessary from a single URL but I'm struggling getting it to loop through the various pages and pull the information from each page. Currently my code is running through all the different page iterations but just rewriting the first page when I call it to print.

The url shows 20 results per page so page 1 end in 0 page 2 ends in 20 page 3 ends in 40 and so on. This is why I have added the calculation of adding x+20 each time.

When I print URL on line 8 it returns each url ending 0,20,40,60,80 twice so the list shows

https://xxxxx.com/0
https://xxxxx.com/0
https://xxxxx.com/20
https://xxxxx.com/20
https://xxxxx.com/40
https://xxxxx.com/40

I'd accept that even, but it's when I request it to print(info) or write (info) to csv it just overwrites itself multiple times and reprints the URL https://xxxxx.com/0

x = 0
While x<100:
    x += 20


    for url in str(x):
        url = "https://xxxxx.com/0"+str(x)

        page = requests.get(url)

        soup = BeautifulSoup(page.content, "html.parser")
        lists = soup.find_all('li', class_="SearchPage__Result-gg133s-2 djuMQD")


        with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:


            thewriter = writer(f)
            header = ['URL', 'address', 'price', 'beds', 'baths', 'ber']
            thewriter.writerow(header)


            for list in lists:

                url = list.find('a').attrs['href']
                address = list.find('p', class_="TitleBlock__Address-sc-1avkvav-8 dzihyY")
                price = list.find('div', class_="TitleBlock__Price-sc-1avkvav-4 hiFkJc")
                beds = list.find_all('p', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
                baths = list.find('p data-testid="baths"', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
                energyrating = list.find('div', class_="TitleBlock__BerContainer-sc-1avkvav-11 iXTpuT")

                info = [url, address, price, beds, baths, energyrating]
                thewriter.writerow(info)

The problem is that you are opening the file in w mode every time you iterate through the loop:

with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:

The mode w creates a new file or truncates it if it already exists. If you want the information obtained in every iteration to be added to the end of the file you should use the a mode, which creates a new file if it doesn't exist or appends the new data to it if it already does.

with open('C:\\Users\hay\Houses.csv', 'a', encoding='UTF8', newline="") as f:

A possible issue with this is that the information will be added to a single file everytime you run the code and it will never be cleaned unless you manually delete the file. A solution to this is to open the file in w mode, but outside the loop:

x = 0
While x<100:
    x += 20

    with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:
    thewriter = writer(f)

        for url in str(x):
            url = "https://xxxxx.com/0"+str(x)

            page = requests.get(url)

            soup = BeautifulSoup(page.content, "html.parser")
            lists = soup.find_all('li', class_="SearchPage__Result-gg133s-2 djuMQD")

            header = ['URL', 'address', 'price', 'beds', 'baths', 'ber']
            thewriter.writerow(header)


            for list in lists:

                url = list.find('a').attrs['href']
                address = list.find('p', class_="TitleBlock__Address-sc-1avkvav-8 dzihyY")
                price = list.find('div', class_="TitleBlock__Price-sc-1avkvav-4 hiFkJc")
                beds = list.find_all('p', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
                baths = list.find('p data-testid="baths"', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
                energyrating = list.find('div', class_="TitleBlock__BerContainer-sc-1avkvav-11 iXTpuT")

            info = [url, address, price, beds, baths, energyrating]
            thewriter.writerow(info)

This way, the file will be truncated every time you run the code, but not for every loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM