簡體   English   中英

使用 Beautiful Soup Python 抓取 Web 頁表

[英]Scraping of Web Page Tables using Beautiful Soup Python

我正在嘗試從Apple wikipedia page中對表格及其內容進行 web 抓取。 我正在使用 Beautiful Soup 來提取數據。 我有以下代碼:

from bs4 import BeautifulSoup
appleurl="https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products"
import requests
import pandas as pad
import lxml.html as html
_content = requests.get(appleurl)
soup = BeautifulSoup(_content.content)
_table = soup.findChildren('table')
rows = _table[0].findChildren(['th','tr'])
for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string
        print ("The value in this cell is %s"% value)

我有以下值:

The value in this cell is 1976
The value in this cell is April 11
The value in this cell is Apple I
The value in this cell is Apple I
The value in this cell is September 1, 1977

The value in this cell is 1977
The value in this cell is April 1
The value in this cell is Apple II
The value in this cell is Apple II
The value in this cell is June 1, 1979

The value in this cell is 1978
The value in this cell is June 1
The value in this cell is Disk II
The value in this cell is Drives
The value in this cell is May 1, 1984

The value in this cell is 1979
The value in this cell is June 1
The value in this cell is Apple II Plus
The value in this cell is Apple II series
The value in this cell is December 1, 1982

The value in this cell is None
The value in this cell is None
The value in this cell is None
The value in this cell is Bell & Howell Disk II
The value in this cell is None
The value in this cell is Apple SilenType
The value in this cell is Printers
The value in this cell is October 1, 1982

問題是1979的模型數量是多個,在我的例子中沒有被提取出來。 我需要1979的所有模型。 如果每年只有一行,我的代碼可以很好地提取。 如果在我提供的鏈接的第一個表中一年有多行,我該怎么辦。 我需要的值是年份、發布日期、Model。 其他兩列可以去掉。 我將非常感謝您的幫助。

喲可以簡單地使用 pandas 來做到這一點。使用pad.read_html()

import pandas as pad

df=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')[0]
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))

Output

 Year Release Date                  Model
0  1976     April 11                Apple I
1  1977      April 1               Apple II
2  1978       June 1                Disk II
3  1979       June 1          Apple II Plus
4  1979       June 1      Apple II EuroPlus
5  1979       June 1        Apple II J-Plus
6  1979       June 1          Bell & Howell
7  1979       June 1  Bell & Howell Disk II
8  1979       June 1        Apple SilenType

更新所有

import pandas as pad

dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
for df in dfs:
   print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))

如果您想在單個 dataframe 中執行此操作,請使用此代碼。

import pandas as pad

dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
dffinal=pd.DataFrame()
for df in dfs:
   df1=pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False)
   dffinal = dffinal.append(df1, ignore_index=True)

print(dffinal)

Output

  Year  Release Date                                              Model
0    1976      April 11                                            Apple I
1    1977       April 1                                           Apple II
2    1978        June 1                                            Disk II
3    1979        June 1                                      Apple II Plus
4    1979        June 1                                  Apple II EuroPlus
5    1979        June 1                                    Apple II J-Plus
6    1979        June 1                                      Bell & Howell
7    1979        June 1                              Bell & Howell Disk II
8    1979        June 1                                    Apple SilenType
9    1980   September 1                                          Apple III
10   1980   September 1                           Modem IIB (Novation CAT)
11   1980   September 1                       Printer IIA (Centronics 779)
12   1980   September 1                                        Monitor III
13   1980   September 1                   Monitor II (various third party)
14   1980   September 1                                           Disk III
15   1981   September 1                                      Apple ProFile
16   1981    December 1                               Apple III Revised[1]
17   1982     October 1                           Apple Dot Matrix Printer
18   1982     October 1                          Apple Daisy Wheel Printer
19   1983     January 1                                          Apple IIe
20   1983     January 1                                      Apple Lisa[2]
21   1983    December 1                                     Apple III Plus
22   1983    December 1                                  Apple ImageWriter
23   1984     January 1                                       Apple Lisa 2
24   1984    January 24                                   Macintosh (128K)
25   1984    January 24               Macintosh External Disk Drive (400K)
26   1984    January 24                                    Apple Modem 300
27   1984    January 24                                   Apple Modem 1200
28   1984       April 1                                          Apple IIc
29   1984       April 1                               Apple Scribe Printer
..    ...           ...                                                ...
606  2019      March 18                                iPad Mini (5th gen)
607  2019      March 19   iMac with Retina 4K display (21.5") (Early 2019)
608  2019      March 19     iMac with Retina 5K display (27") (Early 2019)
609  2019      March 20                                  AirPods (2nd gen)
610  2019        May 21  MacBook Pro with Touch Bar (4th gen) (13") (Mi...
611  2019        May 21  MacBook Pro with Touch Bar (4th gen) (15") (Mi...
612  2019        May 28                               iPod Touch (7th gen)
613  2019        July 9                           MacBook Air (13") (2019)
614  2019        July 9  Macbook Pro with Touch Bar (4th gen) (13") (Mi...
615  2019  September 20                               Apple Watch Series 5
616  2019  September 20                        Apple Watch Hermès Series 5
617  2019  September 20                          Apple Watch Nike Series 5
618  2019  September 20                       Apple Watch Edition Series 5
619  2019  September 20                                  iPhone 8 (128 GB)
620  2019  September 20                             iPhone 8 Plus (128 GB)
621  2019  September 20                                          iPhone 11
622  2019  September 20                                      iPhone 11 Pro
623  2019  September 20                                  iPhone 11 Pro Max
624  2019  September 25                                        iPad (2019)
625  2019    October 30                                        AirPods Pro
626  2019   November 13                   MacBook Pro with Touch Bar (16")
627  2019   December 10                                Mac Pro (Late 2019)
628  2019   December 10                                    Pro Display XDR
629  2020      March 18                                                NaN
630  2020      March 18                           iPad Pro (11") (2nd gen)
631  2020      March 18                         iPad Pro (12.9") (4th gen)
632  2020      March 18                        Magic Keyboard for iPad Pro
633  2020      March 18                           MacBook Air (Early 2020)
634  2020      April 24                                iPhone SE (2nd gen)
635  2020         May 4         MacBook Pro with Magic Keyboard (Mid 2020)

[636 rows x 3 columns]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM