簡體   English   中英

本地 HTML 文件抓取 Urllib 和 BeautifulSoup

[英]Local HTML File Scraping Urllib and BeautifulSoup

我對 python 非常陌生,並且已經從頭開始編寫以下代碼兩周來抓取本地文件。 大概將近一百個小時,我盡可能多地學習 Python、版本性、導入包,例如 lxml、bs4、requests、urllib、os、glob 等等。

我無可救葯地堅持在一個目錄中獲取 12,000 個名稱奇怪的 HTML 文件以使用 BeautifulSoup 加載和解析的第一部分。 我想將所有這些數據放入一個 csv 文件或只是輸出,以便我可以使用剪貼板將其復制到文件中。

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#THIS LOCAL FILE WORKS PERFECTLY. I HAVE 12,000 HTML FILES IN THIS DIRECTORY TO PROCESS.  HOW?
#my_url = 'file://127.0.0.1/C:\\My Web Sites\\BioFachURLS\\www.organic-bio.com\\en\\company\\1-SUNRISE-FARMS.html'
my_url = 'http://www.organic-bio.com/en/company/23694-MARTHOMI-ALLERGY-FREE-FOODS-GMBH'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each field
contactname = page_soup.findAll("td", {"itemprop": "name"})
contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
contactemail = page_soup.findAll("td", {"itemprop": "email"})
contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
contacturl = page_soup.findAll("a", {"itemprop": "url"})

#Outputs as text without tags
Company = contactname[0].text
Address = contactstreetaddress[0].text
Zip = contactpostalcode[0].text
Region = contactaddressregion[0].text
Country = contactaddresscountry[0].text
Fax = contactfax[0].text
Email = contactemail[0].text
Phone = contactphone[0].text
URL = contacturl[0].text

#Prints with comma delimiters

print(Company + ', ' + Address + ', ' + Zip + ', ' + Region + ', ' + Country + ', ' + Fax + ', ' + Email + ', ' + URL)

我以前曾處理過包含成堆文件的文件夾,因此我可以提供一些幫助。

我們將從 for 循環開始到文件夾中的文件

import os
from bs4 import BeautifulSoup as page_soup

phone = [] # A list to store all the phone
path = 'yourpath' # This is your folder name which stores all your html 
#be careful that you might need to put a full path such as C:\Users\Niche\Desktop\htmlfolder 
for filename in os.listdir(path): #Read files from your path

    #Here we are trying to find the full pathname
    for x in filename: #We will have A-H stored as path
        subpath = os.path.join(path, filename) 
        for filename in os.listdir(subpath):
        #Getting the full path of a particular html file
            fullpath = os.path.join(subpath, filename)
            #If we have html tag, then read it
            if fullpath.endswith('.html'): continue
            #Then we will run beautifulsoup to extract the contents
            soup = page_soup(open(fullpath), 'html.parser')
            #Then run your code
            # grabs each field
            contactname = page_soup.findAll("td", {"itemprop": "name"})
            contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
            contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
            contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
            contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
            contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
            contactemail = page_soup.findAll("td", {"itemprop": "email"})
            contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
            contacturl = page_soup.findAll("a", {"itemprop": "url"})

            #Outputs as text without tags
            Company = contactname[0].text
            Address = contactstreetaddress[0].text
            Zip = contactpostalcode[0].text
            Region = contactaddressregion[0].text
            Country = contactaddresscountry[0].text
            Fax = contactfax[0].text
            Email = contactemail[0].text
            Phone = contactphone[0].text
            URL = contacturl[0].text
            #Here you might want to consider using dictionary or a list
            #For example append Phone to list call phone
            phone.append(Phone)

代碼有點亂,但它遍歷了所有可能的文件夾(即使你的主文件夾中有其他文件夾),然后嘗試找到 html 標簽,打開它。

我建議使用帶有公司的字典作為鍵,我認為公司名稱不同。 一堆列表也很棒,因為您的值將相應地排序。 我不擅長字典,所以我不能給你更多的建議。 我希望我能回答你的問題。

PS抱歉,代碼混亂。

編輯:修復用 html.parser 替換 lxml

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM