[英]Web scraping using Python and Beautiful soup: error “'page' is not defined”
[英]Scraping of Web Page Tables using Beautiful Soup Python
我正在嘗試從Apple wikipedia page中對表格及其內容進行 web 抓取。 我正在使用 Beautiful Soup 來提取數據。 我有以下代碼:
from bs4 import BeautifulSoup
appleurl="https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products"
import requests
import pandas as pad
import lxml.html as html
_content = requests.get(appleurl)
soup = BeautifulSoup(_content.content)
_table = soup.findChildren('table')
rows = _table[0].findChildren(['th','tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print ("The value in this cell is %s"% value)
我有以下值:
The value in this cell is 1976
The value in this cell is April 11
The value in this cell is Apple I
The value in this cell is Apple I
The value in this cell is September 1, 1977
The value in this cell is 1977
The value in this cell is April 1
The value in this cell is Apple II
The value in this cell is Apple II
The value in this cell is June 1, 1979
The value in this cell is 1978
The value in this cell is June 1
The value in this cell is Disk II
The value in this cell is Drives
The value in this cell is May 1, 1984
The value in this cell is 1979
The value in this cell is June 1
The value in this cell is Apple II Plus
The value in this cell is Apple II series
The value in this cell is December 1, 1982
The value in this cell is None
The value in this cell is None
The value in this cell is None
The value in this cell is Bell & Howell Disk II
The value in this cell is None
The value in this cell is Apple SilenType
The value in this cell is Printers
The value in this cell is October 1, 1982
問題是1979
的模型數量是多個,在我的例子中沒有被提取出來。 我需要1979
的所有模型。 如果每年只有一行,我的代碼可以很好地提取。 如果在我提供的鏈接的第一個表中一年有多行,我該怎么辦。 我需要的值是年份、發布日期、Model。 其他兩列可以去掉。 我將非常感謝您的幫助。
喲可以簡單地使用 pandas 來做到這一點。使用pad.read_html()
import pandas as pad
df=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')[0]
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
Output :
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
更新所有表。
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
for df in dfs:
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
如果您想在單個 dataframe 中執行此操作,請使用此代碼。
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
dffinal=pd.DataFrame()
for df in dfs:
df1=pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False)
dffinal = dffinal.append(df1, ignore_index=True)
print(dffinal)
Output :
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
9 1980 September 1 Apple III
10 1980 September 1 Modem IIB (Novation CAT)
11 1980 September 1 Printer IIA (Centronics 779)
12 1980 September 1 Monitor III
13 1980 September 1 Monitor II (various third party)
14 1980 September 1 Disk III
15 1981 September 1 Apple ProFile
16 1981 December 1 Apple III Revised[1]
17 1982 October 1 Apple Dot Matrix Printer
18 1982 October 1 Apple Daisy Wheel Printer
19 1983 January 1 Apple IIe
20 1983 January 1 Apple Lisa[2]
21 1983 December 1 Apple III Plus
22 1983 December 1 Apple ImageWriter
23 1984 January 1 Apple Lisa 2
24 1984 January 24 Macintosh (128K)
25 1984 January 24 Macintosh External Disk Drive (400K)
26 1984 January 24 Apple Modem 300
27 1984 January 24 Apple Modem 1200
28 1984 April 1 Apple IIc
29 1984 April 1 Apple Scribe Printer
.. ... ... ...
606 2019 March 18 iPad Mini (5th gen)
607 2019 March 19 iMac with Retina 4K display (21.5") (Early 2019)
608 2019 March 19 iMac with Retina 5K display (27") (Early 2019)
609 2019 March 20 AirPods (2nd gen)
610 2019 May 21 MacBook Pro with Touch Bar (4th gen) (13") (Mi...
611 2019 May 21 MacBook Pro with Touch Bar (4th gen) (15") (Mi...
612 2019 May 28 iPod Touch (7th gen)
613 2019 July 9 MacBook Air (13") (2019)
614 2019 July 9 Macbook Pro with Touch Bar (4th gen) (13") (Mi...
615 2019 September 20 Apple Watch Series 5
616 2019 September 20 Apple Watch Hermès Series 5
617 2019 September 20 Apple Watch Nike Series 5
618 2019 September 20 Apple Watch Edition Series 5
619 2019 September 20 iPhone 8 (128 GB)
620 2019 September 20 iPhone 8 Plus (128 GB)
621 2019 September 20 iPhone 11
622 2019 September 20 iPhone 11 Pro
623 2019 September 20 iPhone 11 Pro Max
624 2019 September 25 iPad (2019)
625 2019 October 30 AirPods Pro
626 2019 November 13 MacBook Pro with Touch Bar (16")
627 2019 December 10 Mac Pro (Late 2019)
628 2019 December 10 Pro Display XDR
629 2020 March 18 NaN
630 2020 March 18 iPad Pro (11") (2nd gen)
631 2020 March 18 iPad Pro (12.9") (4th gen)
632 2020 March 18 Magic Keyboard for iPad Pro
633 2020 March 18 MacBook Air (Early 2020)
634 2020 April 24 iPhone SE (2nd gen)
635 2020 May 4 MacBook Pro with Magic Keyboard (Mid 2020)
[636 rows x 3 columns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.