簡體   English   中英

使用 Python 中的 Beautifulsoup 從網站抓取數據並將其放入 Z251D2BBFE9A3B95E5691CEB30DC6784EBAZ ZBA834BA059A175A3798E4Z9C1 時,某些單元格中的值缺失

[英]Missing values in certain cells when scraping data from website using Beautifulsoup in Python and placing it in Pandas DataFrame

我已經使用 Beautifulsoup 從網站上抓取數據,我想將其放入 Pandas DataFrame 中,然后將其寫入文件。 大多數數據正在按預期寫入文件,但某些單元格缺少值。 例如,電話號碼列的第一行缺少一個值。 郵政編碼列的第 39、45 和 75 行是缺失值。 不知道為什么。

這是我的代碼:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = urlopen(schools)

soup = BeautifulSoup(page,features="html.parser")

table_ = soup.find('table')

Name=[]
Address=[]
PostalCode=[]
Phone=[]
Grades=[]
Website=[]
City=[]
Province=[]

for row in table_.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==6:
        Name.append(cells[1].find(text=True))
        Address.append(cells[4].find(text=True))
        PostalCode.append(cells[4].find(text=True).next_element.getText())
        Phone.append(cells[5].find(text=True).replace('T: ',''))
        Grades.append(cells[2].find(text=True))
        Website.append('https://www.winnipegsd.ca'+cells[1].findAll('a')[0]['href'])


df = pd.DataFrame(Name,columns=['Name'])
df['Street Address']=Address
df['Postal Code']=PostalCode
df['Phone Number']=Phone
df['Grades']=Grades
df['Website']=Website

df.to_csv("file.tsv", sep = "\t",index=False)

嘗試pd.read_html()從表中提取數據。 然后你可以做基本的.str操作:

import requests
import pandas as pd
from bs4 import BeautifulSoup


schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
soup = BeautifulSoup(requests.get(schools).content, "html.parser")

df = pd.read_html(str(soup))[0]
df = df.dropna(how="all", axis=0).drop(columns=["Unnamed: 0", "Unnamed: 3"])
df["Contact"] = (
    df["Contact"]
    .str.replace(r"T:\s*", "", regex=True)
    .str.replace("School Contact Information", "")
    .str.strip()
)
df["Postal Code"] = df["Address"].str.extract(r"(.{3} .{3})$")
df["Website"] = [
    f'https://www.winnipegsd.ca{a["href"]}'
    if "http" not in a["href"]
    else a["href"]
    for a in soup.select("tbody td:nth-child(2) a")
]

print(df.head(10))
df.to_csv("data.csv", index=False)

印刷:

                             School Name Grades                     Address       Contact Postal Code                                            Website
0               Adolescent Parent Centre   9-12       136 Cecil St. R3E 2Y9  204-775-5440     R3E 2Y9  https://www.winnipegsd.ca/AdolescentParentCentre/
1            Andrew Mynarski V.C. School    7-9   1111 Machray Ave. R2X 1H6  204-586-8497     R2X 1H6          https://www.winnipegsd.ca/AndrewMynarski/
2         Argyle Alternative High School  10-12       30 Argyle St. R3B 0H4  204-942-4326     R3B 0H4                  https://www.winnipegsd.ca/Argyle/
3                   Brock Corydon School    N-6   1510 Corydon Ave. R3N 0J6  204-488-4422     R3N 0J6            https://www.winnipegsd.ca/BrockCorydon/
4                       Carpathia School    N-6   300 Carpathia Rd. R3N 1T3  204-488-4514     R3N 1T3               https://www.winnipegsd.ca/Carpathia/
5                       Champlain School    N-6     275 Church Ave. R2W 1B9  204-586-5139     R2W 1B9               https://www.winnipegsd.ca/Champlain/
6      Children of the Earth High School   9-12      100 Salter St. R2W 5M1  204-589-6383     R2W 5M1      https://www.winnipegsd.ca/ChildrenOfTheEarth/
7          Collège Churchill High School   7-12         510 Hay St. R3L 2L6  204-474-1301     R3L 2L6               https://www.winnipegsd.ca/Churchill/
8                         Clifton School    N-6    1070 Clifton St. R3E 2T7  204-783-7792     R3E 2T7                 https://www.winnipegsd.ca/Clifton/
10  Daniel McIntyre Collegiate Institute   9-12  720 Alverstone St. R3E 2H1  204-783-7131     R3E 2H1          https://www.winnipegsd.ca/DanielMcintyre/

並保存data.csv (來自 LibreOffice 的屏幕截圖):

在此處輸入圖像描述

您正在獲得一些缺失的數據值。 因為它們不存在於原始/源 HTML DOM/表中。 因此,如果您沒有檢查,那么您將收到NoneType錯誤並且程序將中斷,但您可以輕松擺脫使用if else None語句修復它們的含義。 以下代碼應該可以工作。

import requests
from bs4 import BeautifulSoup
import pandas as pd

schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = requests.get(schools).text

soup = BeautifulSoup(page,"html.parser")
data =[]
for row in soup.table.find_all('tr'):
    Name = row.select_one('td.ms-rteTableOddCol-6:nth-child(2)')
    Name = Name.a.text if Name else None
    #print(Name)
    Address= row.select_one('td.ms-rteTableEvenCol-6:nth-child(5)')
    Address = Address.get_text() if Address else None 
    #print(Address)
    PostalCode=row.select_one('td.ms-rteTableEvenCol-6:nth-child(5)')
    PostalCode = PostalCode.get_text().split('.')[-1] if PostalCode else None
    #print(PostalCode)
    Phone = row.select_one('td.ms-rteTableOddCol-6:nth-child(6)')
    Phone = Phone.get_text().split('School')[-2].replace('T:','') if Phone else None
    #print(Phone)
    Grades= row.select_one('td.ms-rteTableEvenCol-6:nth-child(3)')
    Grades = Grades.get_text() if Grades else None
    #print(Grades)
    Website= row.select_one('td.ms-rteTableOddCol-6:nth-child(2)')
    Website= 'https://www.winnipegsd.ca'+ Website.a.get('href') if Website else None
    #print(Website)
    data.append({
        'Name':Name,
        'Address':Address,
        'PostalCode':PostalCode,
        'Phone':Phone,
        'Grades':Grades,
        'Website':Website
        })

df=pd.DataFrame(data).dropna(how='all')
print(df)

#df.to_csv("file.tsv", sep = "\t",index=False)

Output:

          Name  ...                                            Website
1          Adolescent Parent Centre  ...  https://www.winnipegsd.ca/AdolescentParentCentre/
2       Andrew Mynarski V.C. School  ...          https://www.winnipegsd.ca/AndrewMynarski/
3    Argyle Alternative High School  ...                  https://www.winnipegsd.ca/Argyle/
4              Brock Corydon School  ...            https://www.winnipegsd.ca/BrockCorydon/
5                  Carpathia School  ...               https://www.winnipegsd.ca/Carpathia/
..                              ...  ...                                                ...
84                    Weston School  ...                  https://www.winnipegsd.ca/Weston/
85             William Whyte School  ...            https://www.winnipegsd.ca/WilliamWhyte/
86  Winnipeg Adult Education Centre  ...   https://www.winnipegsd.ca/WinnipegAdultEdCentre/
87                  Wolseley School  ...                https://www.winnipegsd.ca/Wolseley/
88               WSD Virtual School  ...                 https://www.winnipegsd.ca/Virtual/

[79 rows x 6 columns]

@AndrejKesely 的答案絕對是處理這種情況的一種更 Pythonic 的方式,但您在評論中提到您仍然對原始方法缺少值的原因感興趣。 理所當然:這是學習如何編碼的起點,通過嘗試了解代碼失敗的原因,然后再轉向重構的解決方案。

1.電話號碼

讓我們做一些打印:

for row in table_.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==6:
        # ...
        # Phone.append(cells[5].find(text=True).replace('T: ',''))
        # ...
        print(cells[5].findAll(text=True))

['T:\xa0', '204-775-5440', '\xa0\xa0', 'School Contact Information']
['T: 204-586-8497', '\xa0\xa0', 'School Contact Information', '\xa0']

這里的問題是源代碼不一致。 使用Ctrl + Shift + J打開 Chrome DevTools,右鍵單擊任何電話號碼,然后 select inspect 您將進入“元素”選項卡並查看html是如何設置的。 例如前兩個數字:

ph_no1 = """
<div>
 <span>T:&nbsp;</span>
 <span lang="EN">204-775-5440
  <span>&nbsp;&nbsp;</span>
 </span>
</div>
<div> ... School Contact Information </div>
"""

ph_no2 = """
<div>
 <span lang="FR-CA">T: 204-586-8497
  <span>&nbsp;&nbsp;</span>
 </span>
</div>
<div> ... School Contact Information </div>
"""

前面提到的findAll打印可以讓您連續獲得每個span的文本。 我在這里只展示了前兩個,但這足以說明為什么你會得到不同的數據。 所以,第一個數字條目的問題是cells[5].find(text=True).replace('T: ','')只給我們第一個文本片段,在ph_no1的情況下是'T:\xa0' 有關replace無法處理此問題的原因,請參見例如此SO post

碰巧的是,有幾個電話號碼有問題:

df['Phone Number'][df['Phone Number']\
                   .str.extract(r'(\d{3}-\d{3}-\d{4})')[0]\
                       .to_numpy()!=df['Phone Number'].to_numpy()]

0                T: 
32    204-783-9012  # 2 extra spaces
33    204-474-1492  # 2 extra spaces
38    204-452-5015  # 2 extra spaces

建議的電話號碼解決方案。 嘗試獲取所有文本並使用re.search提取與數字匹配的正則表達式模式,而不是您的代碼:

import re

Phone.append(re.search(r'(\d{3}-\d{3}-\d{4})',cells[5].get_text()).group())
# e.g. \d{3}- means 3 digits followed by "-" etc.

2. 郵政編碼

這里的問題基本相同。 這是一個不規則的郵政編碼(第 39 個條目),后跟一個“常規”郵政編碼;

pc_error = """
<div>
 <span>290 Lilac St.&nbsp;</span>
 <br>R3M 2T5
</div>
"""

regular_pc = """
<div>
 <span>960 Wolseley Ave.&nbsp;</span>
</div>
<div>
 <span>R3G 1E7
 </span>
</div>
"""

你寫了:

Address.append(cells[4].find(text=True))
PostalCode.append(cells[4].find(text=True).next_element.getText())

但正如您在上面看到的,事實證明第一個示例實際上並沒有next_element 現在,如果您嘗試:

print(len(cells[4].findAll(text=True)))

您會發現,無論元素如何,每個單元格的整個文本實際上都會被捕獲為兩個字符串的列表( ['address','postal code'] )。 例如:

['511 Clifton St.\xa0', 'R3G 2X3']
['136 Cecil St.\xa0', 'R3E 2Y9']

因此,在這種特殊情況下,我們可以簡單地編寫:

Address.append(cells[4].findAll(text=True)[0].strip()) # 1st elem and strip
PostalCode.append(cells[4].findAll(text=True)[1].strip()) # 2nd elem and strip

(或再次執行.get_text()並使用正則表達式模式;正如@AndrejKesely 所做的那樣)。

希望這有助於解決問題,並建議一些如何發現意外行為的方法(打印總是一個好朋友。)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM