![](/img/trans.png)
[英]Webpage values are missing while scraping data using BeautifulSoup python 3.6
[英]Missing values in certain cells when scraping data from website using Beautifulsoup in Python and placing it in Pandas DataFrame
我已經使用 Beautifulsoup 從網站上抓取數據,我想將其放入 Pandas DataFrame 中,然后將其寫入文件。 大多數數據正在按預期寫入文件,但某些單元格缺少值。 例如,電話號碼列的第一行缺少一個值。 郵政編碼列的第 39、45 和 75 行是缺失值。 不知道為什么。
這是我的代碼:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = urlopen(schools)
soup = BeautifulSoup(page,features="html.parser")
table_ = soup.find('table')
Name=[]
Address=[]
PostalCode=[]
Phone=[]
Grades=[]
Website=[]
City=[]
Province=[]
for row in table_.findAll("tr"):
cells = row.findAll('td')
if len(cells)==6:
Name.append(cells[1].find(text=True))
Address.append(cells[4].find(text=True))
PostalCode.append(cells[4].find(text=True).next_element.getText())
Phone.append(cells[5].find(text=True).replace('T: ',''))
Grades.append(cells[2].find(text=True))
Website.append('https://www.winnipegsd.ca'+cells[1].findAll('a')[0]['href'])
df = pd.DataFrame(Name,columns=['Name'])
df['Street Address']=Address
df['Postal Code']=PostalCode
df['Phone Number']=Phone
df['Grades']=Grades
df['Website']=Website
df.to_csv("file.tsv", sep = "\t",index=False)
嘗試pd.read_html()
從表中提取數據。 然后你可以做基本的.str
操作:
import requests
import pandas as pd
from bs4 import BeautifulSoup
schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
soup = BeautifulSoup(requests.get(schools).content, "html.parser")
df = pd.read_html(str(soup))[0]
df = df.dropna(how="all", axis=0).drop(columns=["Unnamed: 0", "Unnamed: 3"])
df["Contact"] = (
df["Contact"]
.str.replace(r"T:\s*", "", regex=True)
.str.replace("School Contact Information", "")
.str.strip()
)
df["Postal Code"] = df["Address"].str.extract(r"(.{3} .{3})$")
df["Website"] = [
f'https://www.winnipegsd.ca{a["href"]}'
if "http" not in a["href"]
else a["href"]
for a in soup.select("tbody td:nth-child(2) a")
]
print(df.head(10))
df.to_csv("data.csv", index=False)
印刷:
School Name Grades Address Contact Postal Code Website
0 Adolescent Parent Centre 9-12 136 Cecil St. R3E 2Y9 204-775-5440 R3E 2Y9 https://www.winnipegsd.ca/AdolescentParentCentre/
1 Andrew Mynarski V.C. School 7-9 1111 Machray Ave. R2X 1H6 204-586-8497 R2X 1H6 https://www.winnipegsd.ca/AndrewMynarski/
2 Argyle Alternative High School 10-12 30 Argyle St. R3B 0H4 204-942-4326 R3B 0H4 https://www.winnipegsd.ca/Argyle/
3 Brock Corydon School N-6 1510 Corydon Ave. R3N 0J6 204-488-4422 R3N 0J6 https://www.winnipegsd.ca/BrockCorydon/
4 Carpathia School N-6 300 Carpathia Rd. R3N 1T3 204-488-4514 R3N 1T3 https://www.winnipegsd.ca/Carpathia/
5 Champlain School N-6 275 Church Ave. R2W 1B9 204-586-5139 R2W 1B9 https://www.winnipegsd.ca/Champlain/
6 Children of the Earth High School 9-12 100 Salter St. R2W 5M1 204-589-6383 R2W 5M1 https://www.winnipegsd.ca/ChildrenOfTheEarth/
7 Collège Churchill High School 7-12 510 Hay St. R3L 2L6 204-474-1301 R3L 2L6 https://www.winnipegsd.ca/Churchill/
8 Clifton School N-6 1070 Clifton St. R3E 2T7 204-783-7792 R3E 2T7 https://www.winnipegsd.ca/Clifton/
10 Daniel McIntyre Collegiate Institute 9-12 720 Alverstone St. R3E 2H1 204-783-7131 R3E 2H1 https://www.winnipegsd.ca/DanielMcintyre/
並保存data.csv
(來自 LibreOffice 的屏幕截圖):
您正在獲得一些缺失的數據值。 因為它們不存在於原始/源 HTML DOM/表中。 因此,如果您沒有檢查,那么您將收到NoneType
錯誤並且程序將中斷,但您可以輕松擺脫使用if else None
語句修復它們的含義。 以下代碼應該可以工作。
import requests
from bs4 import BeautifulSoup
import pandas as pd
schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = requests.get(schools).text
soup = BeautifulSoup(page,"html.parser")
data =[]
for row in soup.table.find_all('tr'):
Name = row.select_one('td.ms-rteTableOddCol-6:nth-child(2)')
Name = Name.a.text if Name else None
#print(Name)
Address= row.select_one('td.ms-rteTableEvenCol-6:nth-child(5)')
Address = Address.get_text() if Address else None
#print(Address)
PostalCode=row.select_one('td.ms-rteTableEvenCol-6:nth-child(5)')
PostalCode = PostalCode.get_text().split('.')[-1] if PostalCode else None
#print(PostalCode)
Phone = row.select_one('td.ms-rteTableOddCol-6:nth-child(6)')
Phone = Phone.get_text().split('School')[-2].replace('T:','') if Phone else None
#print(Phone)
Grades= row.select_one('td.ms-rteTableEvenCol-6:nth-child(3)')
Grades = Grades.get_text() if Grades else None
#print(Grades)
Website= row.select_one('td.ms-rteTableOddCol-6:nth-child(2)')
Website= 'https://www.winnipegsd.ca'+ Website.a.get('href') if Website else None
#print(Website)
data.append({
'Name':Name,
'Address':Address,
'PostalCode':PostalCode,
'Phone':Phone,
'Grades':Grades,
'Website':Website
})
df=pd.DataFrame(data).dropna(how='all')
print(df)
#df.to_csv("file.tsv", sep = "\t",index=False)
Output:
Name ... Website
1 Adolescent Parent Centre ... https://www.winnipegsd.ca/AdolescentParentCentre/
2 Andrew Mynarski V.C. School ... https://www.winnipegsd.ca/AndrewMynarski/
3 Argyle Alternative High School ... https://www.winnipegsd.ca/Argyle/
4 Brock Corydon School ... https://www.winnipegsd.ca/BrockCorydon/
5 Carpathia School ... https://www.winnipegsd.ca/Carpathia/
.. ... ... ...
84 Weston School ... https://www.winnipegsd.ca/Weston/
85 William Whyte School ... https://www.winnipegsd.ca/WilliamWhyte/
86 Winnipeg Adult Education Centre ... https://www.winnipegsd.ca/WinnipegAdultEdCentre/
87 Wolseley School ... https://www.winnipegsd.ca/Wolseley/
88 WSD Virtual School ... https://www.winnipegsd.ca/Virtual/
[79 rows x 6 columns]
@AndrejKesely 的答案絕對是處理這種情況的一種更 Pythonic 的方式,但您在評論中提到您仍然對原始方法缺少值的原因感興趣。 理所當然:這是學習如何編碼的起點,通過嘗試了解代碼失敗的原因,然后再轉向重構的解決方案。
1.電話號碼
讓我們做一些打印:
for row in table_.findAll("tr"):
cells = row.findAll('td')
if len(cells)==6:
# ...
# Phone.append(cells[5].find(text=True).replace('T: ',''))
# ...
print(cells[5].findAll(text=True))
['T:\xa0', '204-775-5440', '\xa0\xa0', 'School Contact Information']
['T: 204-586-8497', '\xa0\xa0', 'School Contact Information', '\xa0']
這里的問題是源代碼不一致。 使用Ctrl + Shift + J
打開 Chrome DevTools,右鍵單擊任何電話號碼,然后 select inspect
。 您將進入“元素”選項卡並查看html
是如何設置的。 例如前兩個數字:
ph_no1 = """
<div>
<span>T: </span>
<span lang="EN">204-775-5440
<span> </span>
</span>
</div>
<div> ... School Contact Information </div>
"""
ph_no2 = """
<div>
<span lang="FR-CA">T: 204-586-8497
<span> </span>
</span>
</div>
<div> ... School Contact Information </div>
"""
前面提到的findAll
打印可以讓您連續獲得每個span
的文本。 我在這里只展示了前兩個,但這足以說明為什么你會得到不同的數據。 所以,第一個數字條目的問題是cells[5].find(text=True).replace('T: ','')
只給我們第一個文本片段,在ph_no1
的情況下是'T:\xa0'
。 有關replace
無法處理此問題的原因,請參見例如此SO post
。
碰巧的是,有幾個電話號碼有問題:
df['Phone Number'][df['Phone Number']\
.str.extract(r'(\d{3}-\d{3}-\d{4})')[0]\
.to_numpy()!=df['Phone Number'].to_numpy()]
0 T:
32 204-783-9012 # 2 extra spaces
33 204-474-1492 # 2 extra spaces
38 204-452-5015 # 2 extra spaces
建議的電話號碼解決方案。 嘗試獲取所有文本並使用re.search
提取與數字匹配的正則表達式模式,而不是您的代碼:
import re
Phone.append(re.search(r'(\d{3}-\d{3}-\d{4})',cells[5].get_text()).group())
# e.g. \d{3}- means 3 digits followed by "-" etc.
2. 郵政編碼
這里的問題基本相同。 這是一個不規則的郵政編碼(第 39 個條目),后跟一個“常規”郵政編碼;
pc_error = """
<div>
<span>290 Lilac St. </span>
<br>R3M 2T5
</div>
"""
regular_pc = """
<div>
<span>960 Wolseley Ave. </span>
</div>
<div>
<span>R3G 1E7
</span>
</div>
"""
你寫了:
Address.append(cells[4].find(text=True))
PostalCode.append(cells[4].find(text=True).next_element.getText())
但正如您在上面看到的,事實證明第一個示例實際上並沒有next_element
。 現在,如果您嘗試:
print(len(cells[4].findAll(text=True)))
您會發現,無論元素如何,每個單元格的整個文本實際上都會被捕獲為兩個字符串的列表( ['address','postal code']
)。 例如:
['511 Clifton St.\xa0', 'R3G 2X3']
['136 Cecil St.\xa0', 'R3E 2Y9']
因此,在這種特殊情況下,我們可以簡單地編寫:
Address.append(cells[4].findAll(text=True)[0].strip()) # 1st elem and strip
PostalCode.append(cells[4].findAll(text=True)[1].strip()) # 2nd elem and strip
(或再次執行.get_text()
並使用正則表達式模式;正如@AndrejKesely 所做的那樣)。
希望這有助於解決問題,並建議一些如何發現意外行為的方法(打印總是一個好朋友。)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.