繁体   English   中英

使用 Python/Beautiful Soup/pandas 只从表格中抓取选定的文本

[英]Scrape only selected text from tables using Python/Beautiful soup/pandas

我是 Python 的新手,正在使用漂亮的汤进行项目的网络抓取。

我希望只在列表/字典中获取部分文本。 我从以下代码开始:

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')

这帮助我将数据解析为表格,表格中的一项如下所示:

<table border="0" cellpadding="0" cellspacing="0" width="235">
<tr>
<td align="center" height="238"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><img alt="LL IN ONE SNAIL REPAIR CREAM, SNAIL REPAIR BLEMISH BALM, WATERMAX MOISTURE B.B CREAM, WATERMAX AQUA GEL CREAM, CORRECT COMBO CREAM, GOLD STARFISH ALL IN ONE CREAM, S-VENOM WRINKLE TOX CREAM, BLACK SNAIL ALL IN ONE CREAM, APPLE SMOOTHIE PEELING GEL, REAL SOYBEAN DEEP CLEANSING OIL, COLLAGEN POWER LIFTING CREAM, SNAIL RECOVERY GEL CREAM" border="0" src="http://www.mizon.co.kr/images/upload/product/20150428113514_3.jpg" width="240"/></a></td>
</tr>
<tr>
<td align="center" height="43" valign="middle"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><span class="style3">ENJOY VITAL-UP TIME Lift Up Mask <br/>
                         Volume:25ml</span></a></td>
</tr>
</table>

对于表中的每个此类项目,我只想从上表的最后一个数据单元格中提取以下内容:

1) href = javascript:fnMoveDetail(7499) 中的四位数字

2) 类下的名称:style3

3) 下类卷:style3

我的代码中的下一行如下:

df = pd.read_html(str(tables), skiprows={0}, flavor="bs4")[0]
a_links = soup.find_all('a', attrs={'class':'style3'})
stnid_dict = {}
for a_link in a_links:
    cid = ((a_link['href'].split("javascript:fnMoveDetail("))[1].split(")")[0])
    stnid_dict[a_link.text] = cid

我的目标是使用这些数字转到各个链接,然后将此页面上抓取的信息与每个链接进行匹配。 解决这个问题的最佳方法是什么?

使用包含 javascript href 作为锚点a标签,找到所有span ,然后获取它的parent标签。

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
spans = soup.select('td > a[href*="javascript:fnMoveDetail"] > span')
for span in spans:
    href = span.find_parent('a').get('href').strip('javascript:fnMoveDetail()')
    name, volume = span.get_text(strip=True).split('Volume:')
    print(name, volume, href)

出去:

Dust Clean up Peeling Toner 150ml 8235
Collagen Power Lifting EX Toner 150ml 8067
Collagen Power Lifting EX Emulsion 150ml 8068
Barrier Oil Toner 150ml 8059
Barrier Oil Emulsion 150ml 8060
BLACK CLEAN UP PORE WATER FINISHER 150ml 7650
Vita Lemon Sparkling Toner 150ml 7356
INTENSIVE SKIN BARRIER TONER 150ml 7110
INTENSIVE SKIN BARRIER EMULSION 150ml 7111

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM