[英]Local HTML File Scraping Urllib and BeautifulSoup
我對 python 非常陌生,並且已經從頭開始編寫以下代碼兩周來抓取本地文件。 大概將近一百個小時,我盡可能多地學習 Python、版本性、導入包,例如 lxml、bs4、requests、urllib、os、glob 等等。
我無可救葯地堅持在一個目錄中獲取 12,000 個名稱奇怪的 HTML 文件以使用 BeautifulSoup 加載和解析的第一部分。 我想將所有這些數據放入一個 csv 文件或只是輸出,以便我可以使用剪貼板將其復制到文件中。
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#THIS LOCAL FILE WORKS PERFECTLY. I HAVE 12,000 HTML FILES IN THIS DIRECTORY TO PROCESS. HOW?
#my_url = 'file://127.0.0.1/C:\\My Web Sites\\BioFachURLS\\www.organic-bio.com\\en\\company\\1-SUNRISE-FARMS.html'
my_url = 'http://www.organic-bio.com/en/company/23694-MARTHOMI-ALLERGY-FREE-FOODS-GMBH'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each field
contactname = page_soup.findAll("td", {"itemprop": "name"})
contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
contactemail = page_soup.findAll("td", {"itemprop": "email"})
contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
contacturl = page_soup.findAll("a", {"itemprop": "url"})
#Outputs as text without tags
Company = contactname[0].text
Address = contactstreetaddress[0].text
Zip = contactpostalcode[0].text
Region = contactaddressregion[0].text
Country = contactaddresscountry[0].text
Fax = contactfax[0].text
Email = contactemail[0].text
Phone = contactphone[0].text
URL = contacturl[0].text
#Prints with comma delimiters
print(Company + ', ' + Address + ', ' + Zip + ', ' + Region + ', ' + Country + ', ' + Fax + ', ' + Email + ', ' + URL)
我以前曾處理過包含成堆文件的文件夾,因此我可以提供一些幫助。
我們將從 for 循環開始到文件夾中的文件
import os
from bs4 import BeautifulSoup as page_soup
phone = [] # A list to store all the phone
path = 'yourpath' # This is your folder name which stores all your html
#be careful that you might need to put a full path such as C:\Users\Niche\Desktop\htmlfolder
for filename in os.listdir(path): #Read files from your path
#Here we are trying to find the full pathname
for x in filename: #We will have A-H stored as path
subpath = os.path.join(path, filename)
for filename in os.listdir(subpath):
#Getting the full path of a particular html file
fullpath = os.path.join(subpath, filename)
#If we have html tag, then read it
if fullpath.endswith('.html'): continue
#Then we will run beautifulsoup to extract the contents
soup = page_soup(open(fullpath), 'html.parser')
#Then run your code
# grabs each field
contactname = page_soup.findAll("td", {"itemprop": "name"})
contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
contactemail = page_soup.findAll("td", {"itemprop": "email"})
contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
contacturl = page_soup.findAll("a", {"itemprop": "url"})
#Outputs as text without tags
Company = contactname[0].text
Address = contactstreetaddress[0].text
Zip = contactpostalcode[0].text
Region = contactaddressregion[0].text
Country = contactaddresscountry[0].text
Fax = contactfax[0].text
Email = contactemail[0].text
Phone = contactphone[0].text
URL = contacturl[0].text
#Here you might want to consider using dictionary or a list
#For example append Phone to list call phone
phone.append(Phone)
代碼有點亂,但它遍歷了所有可能的文件夾(即使你的主文件夾中有其他文件夾),然后嘗試找到 html 標簽,打開它。
我建議使用帶有公司的字典作為鍵,我認為公司名稱不同。 一堆列表也很棒,因為您的值將相應地排序。 我不擅長字典,所以我不能給你更多的建議。 我希望我能回答你的問題。
PS抱歉,代碼混亂。
編輯:修復用 html.parser 替換 lxml
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.