简体   繁体   English

使用 Python 中的 Beautifulsoup 从网站抓取数据并将其放入 Z251D2BBFE9A3B95E5691CEB30DC6784EBAZ ZBA834BA059A175A3798E4Z9C1 时,某些单元格中的值缺失

[英]Missing values in certain cells when scraping data from website using Beautifulsoup in Python and placing it in Pandas DataFrame

I have scraped data from a website using Beautifulsoup, and I want to place it into a Pandas DataFrame and then write it to a file.我已经使用 Beautifulsoup 从网站上抓取数据,我想将其放入 Pandas DataFrame 中,然后将其写入文件。 Most of the data is being written to the file as expected, but some cells are missing values.大多数数据正在按预期写入文件,但某些单元格缺少值。 For example, the first row of the Phone number column is missing a value.例如,电话号码列的第一行缺少一个值。 The 39th, 45th, and 75th rows of the Postal code column are missing values.邮政编码列的第 39、45 和 75 行是缺失值。 Not sure why.不知道为什么。

Here is my code:这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = urlopen(schools)

soup = BeautifulSoup(page,features="html.parser")

table_ = soup.find('table')

Name=[]
Address=[]
PostalCode=[]
Phone=[]
Grades=[]
Website=[]
City=[]
Province=[]

for row in table_.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==6:
        Name.append(cells[1].find(text=True))
        Address.append(cells[4].find(text=True))
        PostalCode.append(cells[4].find(text=True).next_element.getText())
        Phone.append(cells[5].find(text=True).replace('T: ',''))
        Grades.append(cells[2].find(text=True))
        Website.append('https://www.winnipegsd.ca'+cells[1].findAll('a')[0]['href'])


df = pd.DataFrame(Name,columns=['Name'])
df['Street Address']=Address
df['Postal Code']=PostalCode
df['Phone Number']=Phone
df['Grades']=Grades
df['Website']=Website

df.to_csv("file.tsv", sep = "\t",index=False)

Try pd.read_html() to extract data from table.尝试pd.read_html()从表中提取数据。 Then you can do basic .str manipulation:然后你可以做基本的.str操作:

import requests
import pandas as pd
from bs4 import BeautifulSoup


schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
soup = BeautifulSoup(requests.get(schools).content, "html.parser")

df = pd.read_html(str(soup))[0]
df = df.dropna(how="all", axis=0).drop(columns=["Unnamed: 0", "Unnamed: 3"])
df["Contact"] = (
    df["Contact"]
    .str.replace(r"T:\s*", "", regex=True)
    .str.replace("School Contact Information", "")
    .str.strip()
)
df["Postal Code"] = df["Address"].str.extract(r"(.{3} .{3})$")
df["Website"] = [
    f'https://www.winnipegsd.ca{a["href"]}'
    if "http" not in a["href"]
    else a["href"]
    for a in soup.select("tbody td:nth-child(2) a")
]

print(df.head(10))
df.to_csv("data.csv", index=False)

Prints:印刷:

                             School Name Grades                     Address       Contact Postal Code                                            Website
0               Adolescent Parent Centre   9-12       136 Cecil St. R3E 2Y9  204-775-5440     R3E 2Y9  https://www.winnipegsd.ca/AdolescentParentCentre/
1            Andrew Mynarski V.C. School    7-9   1111 Machray Ave. R2X 1H6  204-586-8497     R2X 1H6          https://www.winnipegsd.ca/AndrewMynarski/
2         Argyle Alternative High School  10-12       30 Argyle St. R3B 0H4  204-942-4326     R3B 0H4                  https://www.winnipegsd.ca/Argyle/
3                   Brock Corydon School    N-6   1510 Corydon Ave. R3N 0J6  204-488-4422     R3N 0J6            https://www.winnipegsd.ca/BrockCorydon/
4                       Carpathia School    N-6   300 Carpathia Rd. R3N 1T3  204-488-4514     R3N 1T3               https://www.winnipegsd.ca/Carpathia/
5                       Champlain School    N-6     275 Church Ave. R2W 1B9  204-586-5139     R2W 1B9               https://www.winnipegsd.ca/Champlain/
6      Children of the Earth High School   9-12      100 Salter St. R2W 5M1  204-589-6383     R2W 5M1      https://www.winnipegsd.ca/ChildrenOfTheEarth/
7          Collège Churchill High School   7-12         510 Hay St. R3L 2L6  204-474-1301     R3L 2L6               https://www.winnipegsd.ca/Churchill/
8                         Clifton School    N-6    1070 Clifton St. R3E 2T7  204-783-7792     R3E 2T7                 https://www.winnipegsd.ca/Clifton/
10  Daniel McIntyre Collegiate Institute   9-12  720 Alverstone St. R3E 2H1  204-783-7131     R3E 2H1          https://www.winnipegsd.ca/DanielMcintyre/

and saves data.csv (screenshot from LibreOffice):并保存data.csv (来自 LibreOffice 的屏幕截图):

在此处输入图像描述

You are getting some missing data value.您正在获得一些缺失的数据值。 Becaue they didn't exist in the original/source HTML DOM/table.因为它们不存在于原始/源 HTML DOM/表中。 So if you didn't check then you will get NoneType error and the program will break but you can easily get rid of meaning fix them using if else None statemnt.因此,如果您没有检查,那么您将收到NoneType错误并且程序将中断,但您可以轻松摆脱使用if else None语句修复它们的含义。 The following code should work.以下代码应该可以工作。

import requests
from bs4 import BeautifulSoup
import pandas as pd

schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = requests.get(schools).text

soup = BeautifulSoup(page,"html.parser")
data =[]
for row in soup.table.find_all('tr'):
    Name = row.select_one('td.ms-rteTableOddCol-6:nth-child(2)')
    Name = Name.a.text if Name else None
    #print(Name)
    Address= row.select_one('td.ms-rteTableEvenCol-6:nth-child(5)')
    Address = Address.get_text() if Address else None 
    #print(Address)
    PostalCode=row.select_one('td.ms-rteTableEvenCol-6:nth-child(5)')
    PostalCode = PostalCode.get_text().split('.')[-1] if PostalCode else None
    #print(PostalCode)
    Phone = row.select_one('td.ms-rteTableOddCol-6:nth-child(6)')
    Phone = Phone.get_text().split('School')[-2].replace('T:','') if Phone else None
    #print(Phone)
    Grades= row.select_one('td.ms-rteTableEvenCol-6:nth-child(3)')
    Grades = Grades.get_text() if Grades else None
    #print(Grades)
    Website= row.select_one('td.ms-rteTableOddCol-6:nth-child(2)')
    Website= 'https://www.winnipegsd.ca'+ Website.a.get('href') if Website else None
    #print(Website)
    data.append({
        'Name':Name,
        'Address':Address,
        'PostalCode':PostalCode,
        'Phone':Phone,
        'Grades':Grades,
        'Website':Website
        })

df=pd.DataFrame(data).dropna(how='all')
print(df)

#df.to_csv("file.tsv", sep = "\t",index=False)

Output: Output:

          Name  ...                                            Website
1          Adolescent Parent Centre  ...  https://www.winnipegsd.ca/AdolescentParentCentre/
2       Andrew Mynarski V.C. School  ...          https://www.winnipegsd.ca/AndrewMynarski/
3    Argyle Alternative High School  ...                  https://www.winnipegsd.ca/Argyle/
4              Brock Corydon School  ...            https://www.winnipegsd.ca/BrockCorydon/
5                  Carpathia School  ...               https://www.winnipegsd.ca/Carpathia/
..                              ...  ...                                                ...
84                    Weston School  ...                  https://www.winnipegsd.ca/Weston/
85             William Whyte School  ...            https://www.winnipegsd.ca/WilliamWhyte/
86  Winnipeg Adult Education Centre  ...   https://www.winnipegsd.ca/WinnipegAdultEdCentre/
87                  Wolseley School  ...                https://www.winnipegsd.ca/Wolseley/
88               WSD Virtual School  ...                 https://www.winnipegsd.ca/Virtual/

[79 rows x 6 columns]

The answer by @AndrejKesely is definitely a more pythonic way to handle this case, but you mention in the comments that you are still interested as to why your original method had missing values. @AndrejKesely 的答案绝对是处理这种情况的一种更 Pythonic 的方式,但您在评论中提到您仍然对原始方法缺少值的原因感兴趣。 Justifiably so: This is where learning how to code should start, by trying to understand why code is failing, well before moving on to a refactored solution.理所当然:这是学习如何编码的起点,通过尝试了解代码失败的原因,然后再转向重构的解决方案。

1. The phone numbers 1.电话号码

Let's make some prints:让我们做一些打印:

for row in table_.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==6:
        # ...
        # Phone.append(cells[5].find(text=True).replace('T: ',''))
        # ...
        print(cells[5].findAll(text=True))

['T:\xa0', '204-775-5440', '\xa0\xa0', 'School Contact Information']
['T: 204-586-8497', '\xa0\xa0', 'School Contact Information', '\xa0']

The problem here is inconsistency in the source code.这里的问题是源代码不一致。 Open up Chrome DevTools with Ctrl + Shift + J , right click on any of the phone numbers, and select inspect .使用Ctrl + Shift + J打开 Chrome DevTools,右键单击任何电话号码,然后 select inspect You'll move into the "Elements" tab and see how the html is set up.您将进入“元素”选项卡并查看html是如何设置的。 Eg first two numbers:例如前两个数字:

ph_no1 = """
<div>
 <span>T:&nbsp;</span>
 <span lang="EN">204-775-5440
  <span>&nbsp;&nbsp;</span>
 </span>
</div>
<div> ... School Contact Information </div>
"""

ph_no2 = """
<div>
 <span lang="FR-CA">T: 204-586-8497
  <span>&nbsp;&nbsp;</span>
 </span>
</div>
<div> ... School Contact Information </div>
"""

The aforementioned prints with findAll get you the texts from each span consecutively.前面提到的findAll打印可以让您连续获得每个span的文本。 I've only shown the first two here, but that's enough to see why you get different data back.我在这里只展示了前两个,但这足以说明为什么你会得到不同的数据。 So, the problem with the first entry of numbers is that cells[5].find(text=True).replace('T: ','') is only getting us the first text snippet and in the case of ph_no1 this is 'T:\xa0' .所以,第一个数字条目的问题是cells[5].find(text=True).replace('T: ','')只给我们第一个文本片段,在ph_no1的情况下是'T:\xa0' For the reason why the replace cannot handle this, see eg this SO post .有关replace无法处理此问题的原因,请参见例如此SO post

As it happens, a couple of phone numbers were problematic:碰巧的是,有几个电话号码有问题:

df['Phone Number'][df['Phone Number']\
                   .str.extract(r'(\d{3}-\d{3}-\d{4})')[0]\
                       .to_numpy()!=df['Phone Number'].to_numpy()]

0                T: 
32    204-783-9012  # 2 extra spaces
33    204-474-1492  # 2 extra spaces
38    204-452-5015  # 2 extra spaces

Suggested solution for the phone numbers.建议的电话号码解决方案。 Instead of your code, try getting all the text and extracting a regex pattern that matches the number usingre.search :尝试获取所有文本并使用re.search提取与数字匹配的正则表达式模式,而不是您的代码:

import re

Phone.append(re.search(r'(\d{3}-\d{3}-\d{4})',cells[5].get_text()).group())
# e.g. \d{3}- means 3 digits followed by "-" etc.

2. The postal code 2. 邮政编码

Problem here is basically the same.这里的问题基本相同。 Here's an irregular postal code (39th entry), followed by a "regular" one;这是一个不规则的邮政编码(第 39 个条目),后跟一个“常规”邮政编码;

pc_error = """
<div>
 <span>290 Lilac St.&nbsp;</span>
 <br>R3M 2T5
</div>
"""

regular_pc = """
<div>
 <span>960 Wolseley Ave.&nbsp;</span>
</div>
<div>
 <span>R3G 1E7
 </span>
</div>
"""

You wrote:你写了:

Address.append(cells[4].find(text=True))
PostalCode.append(cells[4].find(text=True).next_element.getText())

But as you can see above, it turns out that the first example does not actually have a next_element .但正如您在上面看到的,事实证明第一个示例实际上并没有next_element Now, if you try:现在,如果您尝试:

print(len(cells[4].findAll(text=True)))

You'll find that, regardless of the elements, the entire text of each cell will in fact be captured as a list of two strings ( ['address','postal code'] ).您会发现,无论元素如何,每个单元格的整个文本实际上都会被捕获为两个字符串的列表( ['address','postal code'] )。 Eg:例如:

['511 Clifton St.\xa0', 'R3G 2X3']
['136 Cecil St.\xa0', 'R3E 2Y9']

So, in this particular case, we could simply write:因此,在这种特殊情况下,我们可以简单地编写:

Address.append(cells[4].findAll(text=True)[0].strip()) # 1st elem and strip
PostalCode.append(cells[4].findAll(text=True)[1].strip()) # 2nd elem and strip

(or again do .get_text() and use a regex pattern; as done by @AndrejKesely). (或再次执行.get_text()并使用正则表达式模式;正如@AndrejKesely 所做的那样)。

Hope this helps a bit in clearing up the issues, and suggesting some methods of how to spot unexpected behaviour (prints are always a good friend.).希望这有助于解决问题,并建议一些如何发现意外行为的方法(打印总是一个好朋友。)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM