简体   繁体   English

刮下桌子的底行

[英]Scraping for bottom row of table

I'm using python 3.4. 我正在使用python 3.4。 I know how to utilize BeautifulSoup to scrape a webpage, but I'm trying to come up with the most efficient way to accomplish this. 我知道如何利用BeautifulSoup来抓取网页,但我正在努力想出最有效的方法来实现这一目标。 The Nexus factory image page (Android) contains lists of all Nexus devices and is updated when a new build is available. Nexus工厂映像页面 (Android)包含所有Nexus设备的列表,并在新版本可用时更新。 The newest builds are always added to the bottom of the respective table. 最新版本始终添加到相应表格的底部。 I have a list of the names of each device, both real name and codename, and I only pull these (the devices themselves are only updated once/year, if that, and only some of the devices still receive updates). 我有一个每个设备的名称列表,包括真实姓名和代号,我只提取这些(设备本身只更新一次/年,如果那样,只有部分设备仍然接收更新)。

What would be the most efficient way to pull the bottom entry out of each table? 从每个表中拉出底部条目的最有效方法是什么? I plan to save each string from the first <td> in the bottom rows as pickled objects so I can easily compare strings later to check if the current bottom row is new, but I'm not sure what the best way would be to scrape for the entry itself. 我计划将底行中第一个<td>中的每个字符串保存为pickle对象,这样我以后可以轻松比较字符串以检查当前底行是否是新的,但我不确定刮擦的最佳方法是什么对于入口本身。

Each <tr> has an id of the format devnamebuildnumber . 每个<tr>都有一个devnamebuildnumber格式的id。 Since I have the name of each device and will have the latest string, I should be able to search by that using soup.find("tr", id=dev + buildstring) . 因为我有每个设备的名称并且将具有最新的字符串,所以我应该能够使用soup.find("tr", id=dev + buildstring) That returns every sibling and child of the found row, however, so I'm not sure how to best utilize that. 然而,这会返回找到的行的每个兄弟和孩子,所以我不确定如何最好地利用它。

Here is something to get you started. 这是让你入门的东西。 The idea is to get the h2 elements with id attribute - except the very first element these would be the device name elements. 我们的想法是获取具有id属性的h2元素 - 除了第一个元素,这些元素将是设备名称元素。 For every element found, let's get the next table element and parse the versions into a list. 对于找到的每个元素,让我们获取下一个table元素并将版本解析为列表。 Implementation: 执行:

from pprint import pprint

import requests
from bs4 import BeautifulSoup


url = "https://developers.google.com/android/nexus/images"
response = requests.get(url)

soup = BeautifulSoup(response.content, "lxml")

data = {}
for device in soup.find_all("h2", id=True)[1:]:
    device_name = device.get_text(strip=True)

    data[device_name] = [version.find("td").get_text(strip=True)
                         for version in device.find_next("table").find_all("tr", id=True)]

pprint(data)

Prints the dictionary with device names as keys and versions as values: 使用设备名称作为键打印字典,并将版本作为值打印:

{'"angler" for Nexus 6P': ['6.0.0 (MDA89D)',
                           '6.0.0 (MDB08K)',
                           '6.0.0 (MDB08L)',
                           '6.0.0 (MDB08M)',
                           '6.0.0 (MMB29N)',
                           '6.0.1 (MMB29M)',
                           '6.0.1 (MMB29P)'],
 '"bullhead" for Nexus 5X': ['6.0.0 (MDA89E)',
                             '6.0.0 (MDB08I)',
                             '6.0.0 (MDB08L)',
                             '6.0.0 (MDB08M)',
                             '6.0.1 (MMB29K)',
                             '6.0.1 (MMB29P)'],
 '"fugu" for Nexus Player': ['5.0 (LRX21M)',
                             '5.0 (LRX21V)',
                             '5.1.0 (LMY47D)',
                             '5.1.1 (LMY47V)',
                             '5.1.1 (LMY48J)',
                             '5.1.1 (LMY48N)',
                             '6.0.0 (MRA58K)',
                             '6.0.0 (MRA58N)',
                             '6.0.1 (MMB29M)',
                             '6.0.1 (MMB29T)'],
 '"hammerhead" for Nexus 5 (GSM/LTE)': ['4.4 (KRT16M)',
                                        '4.4.2 (KOT49H)',
                                        '4.4.3 (KTU84M)',
                                        '4.4.4 (KTU84P)',
                                        '4.4.4 Release 2 (For 2Degrees/NZ, '
                                        'Telstra/AUS and India ONLY) (KTU84Q)',
                                        '5.0 (LRX21O)',
                                        '5.0.1 (LRX22C)',
                                        '5.1.0 (LMY47D)',
                                        '5.1.0 (LMY47I)',
                                        '5.1.1 (LMY48B)',
                                        '5.1.1 (LMY48I)',
                                        '5.1.1 (LMY48M)',
                                        '6.0.0 (MRA58K)',
                                        '6.0.0 (MRA58N)',
                                        '6.0.1 (MMB29K)',
                                        '6.0.1 (MMB29S)'],
 '"mantaray" for Nexus 10': ['4.2.2 (JDQ39)',
                             '4.3 (JWR66Y)',
                             '4.4 (KRT16S)',
                             '4.4.2 (KOT49H)',
                             '4.4.3 (KTU84L)',
                             '4.4.4 (KTU84P)',
                             '5.0 (LRX21P)',
                             '5.0.1 (LRX22C)',
                             '5.0.2 (LRX22G)',
                             '5.1.0 (LMY47D)',
                             '5.1.1 (LMY47V)',
                             '5.1.1 (LMY48I)',
                             '5.1.1 (LMY48M)',
                             '5.1.1 (LMY48T)',
                             '5.1.1 (LMY48X)',
                             '5.1.1 (LMY48Z)',
                             '5.1.1 (LMY49F)'],
 '"mysid" for Galaxy Nexus "toro" (Verizon CDMA/LTE)': ['4.0.4 (IMM76K)',
                                                        '4.1.1 (JRO03O)',
                                                        '4.2.2 (JDQ39)'],
 '"mysidspr" for Galaxy Nexus "toroplus" (Sprint CDMA/LTE)': ['4.1.1 (FH05)',
                                                              '4.2.1 (GA02)'],
 '"nakasi" for Nexus 7 (Wi-Fi)': ['4.1.2 (JZO54K)',
                                  '4.2.2 (JDQ39)',
                                  '4.3 (JWR66Y)',
                                  '4.4 (KRT16S)',
                                  '4.4.2 (KOT49H)',
                                  '4.4.3 (KTU84L)',
                                  '4.4.4 (KTU84P)',
                                  '5.0 (LRX21P)',
                                  '5.0.2 (LRX22G)',
                                  '5.1.0 (LMY47D)',
                                  '5.1.1 (LMY47V)'],
 '"nakasig" for Nexus 7 (Mobile)': ['4.2.2 (JDQ39)',
                                    '4.3 (JWR66Y)',
                                    '4.4 (KRT16S)',
                                    '4.4.2 (KOT49H)',
                                    '4.4.3 (KTU84L)',
                                    '4.4.4 (KTU84P)',
                                    '5.0.2 (LRX22G)',
                                    '5.1.0 (LMY47D)',
                                    '5.1.1 (LMY47V)'],
 '"occam" for Nexus 4': ['4.2.2 (JDQ39)',
                         '4.3 (JWR66Y)',
                         '4.4 (KRT16S)',
                         '4.4.2 (KOT49H)',
                         '4.4.3 (KTU84L)',
                         '4.4.4 (KTU84P)',
                         '5.0 (LRX21T)',
                         '5.0.1 (LRX22C)',
                         '5.1.0 (LMY47O)',
                         '5.1.1 (LMY47V)',
                         '5.1.1 (LMY48I)',
                         '5.1.1 (LMY48M)',
                         '5.1.1 (LMY48T)'],
 '"razor" for Nexus 7 [2013] (Wi-Fi)': ['4.3 (JSS15Q)',
                                        '4.3 (JSS15R)',
                                        '4.4 (KRT16S)',
                                        '4.4.2 (KOT49H)',
                                        '4.4.3 (KTU84L)',
                                        '4.4.4 (KTU84P)',
                                        '5.0 (LRX21P)',
                                        '5.0.1 (LRX22C)',
                                        '5.0.2 (LRX22G)',
                                        '5.1.0 (LMY47O)',
                                        '5.1.1 (LMY47V)',
                                        '5.1.1 (LMY48G)',
                                        '5.1.1 (LMY48I)',
                                        '5.1.1 (LMY48M)',
                                        '5.1.1 (LMY48T)',
                                        '6.0.0 (MRA58K)',
                                        '6.0.0 (MRA58U)',
                                        '6.0.0 (MRA58V)',
                                        '6.0.1 (MMB29K)',
                                        '6.0.1 (MMB29O)'],
 '"razorg" for Nexus 7 [2013] (Mobile)': ['4.3 (JLS36C)',
                                          '4.3.1 (JLS36I)',
                                          '4.4 (KRT16S)',
                                          '4.4.2 (KOT49H)',
                                          '4.4.2_r2 (Verizon) (KVT49L)',
                                          '4.4.3 (KTU84L)',
                                          '4.4.4 (KTU84P)',
                                          '5.0.2 (LRX22G)',
                                          '5.1.0 (LMY47O)',
                                          '5.1.1 (LMY47V)',
                                          '5.1.1 (LMY48P)',
                                          '5.1.1 (LMY48U)',
                                          '5.1.1 (LMY48X)',
                                          '5.1.1 (LMY48Z)',
                                          '6.0.0 (MRA58K)',
                                          '6.0.0 (MRA58N)',
                                          '6.0.0 (MRA58V)',
                                          '6.0.0 (MRA59B)',
                                          '6.0.1 (MMB29K)',
                                          '6.0.1 (MMB29O)'],
 '"ryu" for Pixel C': ['6.0.1 (MXB48J)', '6.0.1 (MXB48K)'],
 '"shamu" for Nexus 6': ['5.0 (LRX21O)',
                         '5.0.1 (LRX22C)',
                         '5.1.0 (LMY47D)',
                         '5.1.0 (LMY47E)',
                         '5.1.0 (LMY47I)',
                         '5.1.0 (For T-Mobile ONLY) (LMY47M)',
                         '5.1.1 (All carriers except T-Mobile US) (LMY47Z)',
                         '5.1.1 (For T-Mobile ONLY) (LYZ28E)',
                         '5.1.1 (For Project Fi ONLY) (LVY48C)',
                         '5.1.1 (LMY48I)',
                         '5.1.1 (For T-Mobile ONLY) (LYZ28J)',
                         '5.1.1 (For Project Fi ONLY) (LVY48E)',
                         '5.1.1 (LMY48M)',
                         '5.1.1 (For T-Mobile ONLY) (LYZ28K)',
                         '5.1.1 (For Project Fi ONLY) (LVY48F)',
                         '5.1.1 (LMY48T)',
                         '5.1.1 (For T-Mobile ONLY) (LYZ28M)',
                         '5.1.1 (For Project Fi ONLY) (LVY48H)',
                         '5.1.1 (LMY48W)',
                         '5.1.1 (LMY48X)',
                         '5.1.1 (LMY48Y)',
                         '5.1.1 (For T-Mobile ONLY) (LYZ28N)',
                         '5.1.1 (For Project Fi ONLY) (LVY48I)',
                         '6.0.0 (MRA58K)',
                         '6.0.0 (MRA58N)',
                         '6.0.0 (MRA58R)',
                         '6.0.0 (MRA58X)',
                         '6.0.1 (MMB29K)',
                         '6.0.1 (MMB29S)'],
 '"soju" for Nexus S (worldwide version, i9020t and i9023)': ['2.3.6 (GRK39F)',
                                                              '4.0.4 (IMM76D)',
                                                              '4.1.2 (JZO54K)'],
 '"sojua" for Nexus S (850MHz version, i9020a)': ['2.3.6 (GRK39F)',
                                                  '4.0.4 (IMM76D)',
                                                  '4.1.2 (JZO54K)'],
 '"sojuk" for Nexus S (Korea version, m200)': ['2.3.6 (GRK39F)',
                                               '4.0.4 (IMM76D)',
                                               '4.1.1 (JRO03E)'],
 '"sojus" for Nexus S 4G (d720)': ['2.3.7 (GWK74)',
                                   '4.0.4 (IMM76D)',
                                   '4.1.1 (JRO03R)'],
 '"takju" for Galaxy Nexus "maguro" (GSM/HSPA+) (with Google Wallet)': ['4.0.4 '
                                                                        '(IMM76I)',
                                                                        '4.1.2 '
                                                                        '(JZO54K)',
                                                                        '4.2.2 '
                                                                        '(JDQ39)',
                                                                        '4.3 '
                                                                        '(JWR66Y)'],
 '"tungsten" for Nexus Q': ['4.0.4 (IAN67K)'],
 '"volantis" for Nexus 9 (Wi-Fi)': ['5.0 (LRX21Q)',
                                    '5.0 (LRX21R)',
                                    '5.0.1 (LRX22C)',
                                    '5.0.2 (LRX22L)',
                                    '5.1.1 (LMY47X)',
                                    '5.1.1 (LMY48I)',
                                    '5.1.1 (LMY48M)',
                                    '5.1.1 (LMY48T)',
                                    '6.0.0 (MRA58K)',
                                    '6.0.0 (MRA58N)',
                                    '6.0.1 (MMB29K)',
                                    '6.0.1 (MMB29S)'],
 '"volantisg" for Nexus 9 (LTE)': ['5.0.1 (LRX22C)',
                                   '5.0.2 (LRX22L)',
                                   '5.1.1 (LMY47X)',
                                   '5.1.1 (LMY48I)',
                                   '5.1.1 (LMY48M)',
                                   '5.1.1 (LMY48T)',
                                   '5.1.1 (LMY48X)',
                                   '5.1.1 (LMY48Z)',
                                   '5.1.1 (LMY49F)',
                                   '6.0.0 (MRA58K)',
                                   '6.0.0 (MRA58N)',
                                   '6.0.1 (MMB29K)',
                                   '6.0.1 (MMB29S)'],
 '"yakju" for Galaxy Nexus "maguro" (GSM/HSPA+)': ['4.0.4 (IMM76I)',
                                                   '4.1.2 (JZO54K)',
                                                   '4.2.2 (JDQ39)',
                                                   '4.3 (JWR66Y)']}

The following produces a list containing the last entry from each device. 以下内容生成一个列表,其中包含每个设备的最后一个条目 To do this you still need to iterate through all of the items, but then just keep the last entry as follows: 要做到这一点,你仍然需要遍历所有项目,但是只需保留最后一个条目,如下所示:

from bs4 import BeautifulSoup       
import requests


html = requests.get("https://developers.google.com/android/nexus/images")
soup = BeautifulSoup(html.text, "lxml")
models = []

for h2 in soup.find_all('h2', id=True)[1:]:
    tr = h2.find_next('table').find_all('tr', id=True)[-1]
    td = [t.text.strip() for t in tr.find_all('td')]
    models.append([h2.text] + td)

for device, version, link, cs1, cs2 in models:
    print '{}, {}'.format(device, version)

This displays the following: 这显示以下内容:

"ryu" for Pixel C, 6.0.1 (MXB48K)
"angler" for Nexus 6P, 6.0.1 (MMB29P)
"bullhead" for Nexus 5X, 6.0.1 (MMB29P)
"shamu" for Nexus 6, 6.0.1 (MMB29S)
"fugu" for Nexus Player, 6.0.1 (MMB29T)
"volantisg" for Nexus 9 (LTE), 6.0.1 (MMB29S)
"volantis" for Nexus 9 (Wi-Fi), 6.0.1 (MMB29S)
"hammerhead" for Nexus 5 (GSM/LTE), 6.0.1 (MMB29S)
"razor" for Nexus 7 [2013] (Wi-Fi), 6.0.1 (MMB29O)
"razorg" for Nexus 7 [2013] (Mobile), 6.0.1 (MMB29O)
"mantaray" for Nexus 10, 5.1.1 (LMY49F)
"occam" for Nexus 4, 5.1.1 (LMY48T)
"nakasi" for Nexus 7 (Wi-Fi), 5.1.1 (LMY47V)
"nakasig" for Nexus 7 (Mobile), 5.1.1 (LMY47V)
"tungsten" for Nexus Q, 4.0.4 (IAN67K)
"takju" for Galaxy Nexus "maguro" (GSM/HSPA+) (with Google Wallet), 4.3 (JWR66Y)
"yakju" for Galaxy Nexus "maguro" (GSM/HSPA+), 4.3 (JWR66Y)
"mysid" for Galaxy Nexus "toro" (Verizon CDMA/LTE), 4.2.2 (JDQ39)
"mysidspr" for Galaxy Nexus "toroplus" (Sprint CDMA/LTE), 4.2.1 (GA02)
"soju" for Nexus S (worldwide version, i9020t and i9023), 4.1.2 (JZO54K)
"sojua" for Nexus S (850MHz version, i9020a), 4.1.2 (JZO54K)
"sojuk" for Nexus S (Korea version, m200), 4.1.1 (JRO03E)
"sojus" for Nexus S 4G (d720), 4.1.1 (JRO03R)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM