Python crawl multiples URLs from a CSV and export to another CSV

Question

I need to loop over URLs stored in a CSV file. I want to extract phones and ZIPs from the URLs listed.

Please if you can help me, I appreciate!

   # read csv with just url per line
    with open('urls.csv') as file:
        start_urls = [line.strip() for line in file]

    def start_request(self):
        request = Request(url = self.start_urls, callback=self.parse)
        yield request
    
    def parse(self, response):
    
            html = response.body
            soup = BeautifulSoup(html, 'lxml')
            text = soup.get_text()

            phone = re.findall(r'\d{3}-\d{3}-\d{4}', html, re.MULTILINE)
            zipcode = re.findall(r'(?<=, [A-Z]{2} )\d{5}', html, re.MULTILINE)
            phn_1 = []
            zipcode_1 = []
´´´

Answer 1

You described your goal but didn't mention what part is currently not working.

You wrote this:

    def start_request(self):
        request = Request(url=self.start_urls, callback=self.parse)
        yield request

It isn't obvious that that's what you want. In particular I would expect Request() to accept a single url rather than a list. Also, using a callback is fine but perhaps fancier than needed. Try this simplified approach:

for url in start_urls:
    self.parse(Request(url=url))

I'm sure this expression works fine for you: [line.strip() for line in file] . To emphasize that it is all about dealing with newlines, it would be clearer to use

line.rstrip()

instead of

line.strip()

Answer 2

Thanks for the answer! I can looping but I'm not able of getting the phones and ZIPs while I´m looping for get after an CSV with the data. Any help I would appreciate!

Python crawl multiples URLs from a CSV and export to another CSV

Question

2 answers

solution1
1 2022-06-13 02:13:29

solution2
0 2022-06-16 20:00:55

Python crawl multiples URLs from a CSV and export to another CSV

Question

2 answers

solution1 1 2022-06-13 02:13:29

solution2 0 2022-06-16 20:00:55

solution1
1 2022-06-13 02:13:29

solution2
0 2022-06-16 20:00:55