简体   繁体   English

将电话和 ZIP URL 从 CSV 抓取到 CSV

[英]Scraping phones and ZIPs URLs from CSV to CSV

I need to scrape a list of URLs stored in a CSV and export to another CSV.我需要抓取存储在 CSV 中的 URL 列表并导出到另一个 CSV。 I must make some mistake cause I can't run it.我必须犯一些错误,因为我无法运行它。 So if anyone can help me, I appreciate.因此,如果有人可以帮助我,我将不胜感激。

I'm very new in Python and also unite some codes so I have some issues to identificate where is the problem.我是 Python 的新手,并且还结合了一些代码,所以我有一些问题需要确定问题出在哪里。 I mixed a code that import an CSV and another code that require an string search.我混合了一个导入 CSV 的代码和另一个需要字符串搜索的代码。

    import scrapy
    from scrapy import Spider
    from scrapy import Request
    from scrapy.crawler import CrawlerProcess
    from bs4 import BeautifulSoup
    from urllib.parse import urlparse
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    from urllib.request import urlopen,urlparse, Request,HTTPError
    import re
    import numpy as np
    import csv
    from http.client import BadStatusLine
    import ssl

The following is the code that I have so far.以下是我到目前为止的代码。

    
    phn_1 = []
    zipcode_1 = []
    err_msg_zipcode = []
    err = []
    
    class Spider:
      name = 'spider'
    
        # read csv with just url per line
      with open('urls.csv') as file:
          start_urls = [line.strip() for line in file]
     
      def start_request(self):
          request = Request(url = self.start_urls, callback=self.parse)
          yield request
        
      def parse(self, response):
        
          s = response.body
          soup = BeautifulSoup(html, 'lxml')
          text = soup.get_text()
    
          df2=pd.DataFrame()
    
          phn_1 = []    #store all the extracted Phn numbers in a List
          mail_1 = []    #store all the extracted Zipcode in a List
          for line in df2.iterrows():  # Parse through each url in the list.
              try:
                  try:
                      req1 = Request(row[1]['URL'], headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36'})
                      gcontext = ssl.SSLContext(ssl.PROTOCOL_SSLv23) # Bypass SSL certification verification
                      f = urlopen(req1, context=gcontext)
                      url_name = f.geturl() #extract URL name 
                      s = f.read()
                      phone = re.findall(r'\d{3}-\d{3}-\d{4}', s, re.MULTILINE)
                      zipcode = re.findall(r'(?<=, [A-Z]{2} )\d{5}', s, re.MULTILINE)
    
                      if len(phone) == 0:
                          print("No phone number found.")
                          err_msg_phn = "No phone number found."
                          phn_1.append((url_name, err_msg_phn))
                  
                      else:
                          count = 1
                          for item in phone:
                              phn_1.append((url_name,item))
                              count += 1
                          print(phn_1)
            
                      if len(zipcode) == 0:
                          print("No zipcode found.")
                          err_msg_zipcode = "No zipcode address found."
                          zipcode_1.append((url_name,err_msg_zipcode))
    
                      else:
                          count = 1
                          for item in zipcode:
                              mail_1.append((url_name,item))
                              count += 1
                          print(mail_1)
                   
                  except BadStatusLine: # Catch if invalid url names exist
                      print("could not fetch %s" % url_name)
    
              except urllib3.request.HTTPError as err: # catch HTTP 404 not found error
                  if err == 404:
                      print("Received HTTPError on %s" % url_name)
                
    
    df_p = pd.DataFrame()
    df_m = pd.DataFrame()
    df_final = pd.DataFrame()
    
    df_p = pd.DataFrame(phn_1,columns=['URL','Phone_No']) # Dataframe for url and Phn number
    df_phn = df_p.drop_duplicates(subset=['URL', 'Phone_No'], keep='first') #remove duplicates
    
    df_m = pd.DataFrame(zipcode_1,columns=['URL','Zipcode']) # Dataframe for url and Zipcode
    df_mail = df_m.drop_duplicates(subset=['URL','Zipcode'], keep='first') #remove duplicates
    
    df_final = pd.merge(df_phn,df_mail, on = 'URL', how = 'inner') #Merge two dataframes on the common column
    #df_final.groupby(['URL'], as_index=False)
    df_final.to_csv('result_contact.csv', index=False, encoding='utf-8')
    
    #convert the csv output to json
    with open('result_contact.csv') as f:
         reader = csv.DictReader(f)
         rows = list(reader)

Thank you!!!谢谢!!!

One obvious mistake I see is here:我看到的一个明显错误是:

  request = Request(url = self.start_urls, callback=self.parse)

url should be a string, but you are sending a list. url应该是一个字符串,但您正在发送一个列表。 If you want to send multiple requests, you need to use a loop.如果要发送多个请求,则需要使用循环。 As you are already setting start_urls and using the parse callback, you do not need to override start_requests .由于您已经在设置start_urls并使用parse回调,因此您不需要覆盖start_requests The default implementation should take care of it.默认实现应该处理它。

You may want to consider setting the start_urls in the __init__ method.您可能需要考虑在__init__方法中设置start_urls

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM