简体   繁体   English

HTML表格特定的行搜寻

[英]HTML Table Specific Row Scraping

I want to scrape data from specific rows of this table . 我想从该表的特定行中抓取数据。 I want the orange/gold rows only. 我只想要橙色/金色行。 Previously, I used this code provided by SIM to scrape the whole table information and I manipulated it afterwards: 以前,我使用SIM提供的以下代码来抓取整个表格的信息,然后再进行操作:

from selenium.webdriver import Chrome
from contextlib import closing
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

URL = "https://www.n2yo.com/passes/?s=39090&a=1"

chrome_options = Options()  
chrome_options.add_argument("--headless")

with closing(Chrome(chrome_options=chrome_options)) as driver:
    driver.get(URL)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    for items in soup.select("#passestable tr"):
        data = [item.text for item in items.select("th,td")]
        print(data)

I'm unsure how to alter this code to obtain only the orange/gold rows. 我不确定如何更改此代码以仅获取橙色/金色行。 I tried searching for the colour code as a tag when parsing but it didn't work. 解析时,我尝试搜索颜色代码作为标签,但是没有用。 Any and all suggestions appreciated. 任何和所有建议表示赞赏。

Thank you for your time. 感谢您的时间。

You can use regex to match the colors: 您可以使用正则表达式来匹配颜色:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re
d = driver.Chrome()
d.get("https://www.n2yo.com/passes/?s=39090&a=1")
s = soup(d.page_source, 'lxml')
data = [i.text for i in s.find_all('tr', {'bgcolor':re.compile('#FFFFFF|#FFFF33|#FFCC00')})]

Output: 输出:

[u'16-Mar 20:34N12\xb020:42W265\xb079\xb020:48SSW199\xb0-Map and details', u'17-Mar 07:51S178\xb007:58W260\xb052\xb008:05NNW341\xb0-Map and details', u'17-Mar 20:00NNE19\xb020:08E102\xb050\xb020:14S180\xb0-Map and details', u'18-Mar 07:17SSE160\xb007:24E83\xb077\xb007:31N349\xb0-Map and details', u'18-Mar 08:58SW217\xb009:04W269\xb013\xb009:09NW323\xb0-Map and details', u'18-Mar 21:06N6\xb021:13WNW295\xb041\xb021:19SW217\xb0-Map and details', u'19-Mar 06:43SE142\xb006:50ENE67\xb038\xb006:57N356\xb0-Map and details', u'19-Mar 08:23SSW196\xb008:30W268\xb027\xb008:36NNW333\xb0-Map and details', u'19-Mar 20:32N12\xb020:39WNW286\xb084\xb020:46SSW198\xb0-Map and details', u'20-Mar 07:48S177\xb007:55WSW254\xb055\xb008:02NNW342\xb0-Map and details', u'20-Mar 19:58NNE20\xb020:05E98\xb047\xb020:12S178\xb0-Map and details', u'21-Mar 07:14SSE159\xb007:22NE58\xb072\xb007:28N349\xb0-Map and details', u'21-Mar 08:55SW216\xb009:01W272\xb014\xb009:07NW325\xb0-Map and details', u'21-Mar 21:03N6\xb021:10WNW288\xb043\xb021:17SW215\xb0-Map and details', u'22-Mar 06:41SE141\xb006:48ENE70\xb036\xb006:54N356\xb0-Map and details', u'22-Mar 08:20S194\xb008:27W265\xb029\xb008:34NNW335\xb0-Map and details', u'22-Mar 20:29N13\xb020:36N348\xb086\xb020:43SSW196\xb0-Map and details', u'23-Mar 07:46S176\xb007:53W265\xb059\xb008:00NNW343\xb0-Map and details', u'23-Mar 19:55NNE20\xb020:02E94\xb045\xb020:09S177\xb0-Map and details', u'24-Mar 07:12SSE157\xb007:19ENE71\xb069\xb007:26N350\xb0-Map and details', u'24-Mar 08:53SW214\xb008:59W270\xb015\xb009:04NW325\xb0-Map and details', u'24-Mar 21:01N7\xb021:08WNW292\xb046\xb021:14SW214\xb0-Map and details', u'25-Mar 06:38SE139\xb006:45ENE65\xb034\xb006:52N357\xb0-Map and details', u'25-Mar 08:18S193\xb008:24W263\xb030\xb008:31NNW335\xb0-Map and details', u'25-Mar 18:49NE39\xb018:54E87\xb010\xb018:59SE134\xb0-Map and details', u'25-Mar 20:27N13\xb020:34SSE161\xb086\xb020:41S195\xb0-Map and details']

Try to replace this line 尝试替换这条线

for items in soup.select("#passestable tr"):

with this one 与这个

for items in soup.select("#passestable tr[bgcolor='#FFCC00'], #passestable tr[bgcolor='#FFFF33']"):

To iterate through the tr nodes of only required colors 遍历仅需要颜色的tr节点

Note that this will return all the orange nodes and only then all the gold nodes 请注意,这将返回所有橙色节点,然后返回所有黄金节点

Another approach that you can try which is not using selenium : 您可以尝试的另一种不使用selenium

from lxml.html import fromstring
import requests

r = requests.get(URL)
html = fromstring((r.content).decode('utf-8'))
# only orange and yellow rows
rows = html.xpath('//tr[@bgcolor="#FFFF33" or @bgcolor="#FFCC00"]')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM