簡體   English   中英

使用 Python/Beautiful Soup/pandas 只從表格中抓取選定的文本

[英]Scrape only selected text from tables using Python/Beautiful soup/pandas

我是 Python 的新手,正在使用漂亮的湯進行項目的網絡抓取。

我希望只在列表/字典中獲取部分文本。 我從以下代碼開始:

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')

這幫助我將數據解析為表格,表格中的一項如下所示:

<table border="0" cellpadding="0" cellspacing="0" width="235">
<tr>
<td align="center" height="238"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><img alt="LL IN ONE SNAIL REPAIR CREAM, SNAIL REPAIR BLEMISH BALM, WATERMAX MOISTURE B.B CREAM, WATERMAX AQUA GEL CREAM, CORRECT COMBO CREAM, GOLD STARFISH ALL IN ONE CREAM, S-VENOM WRINKLE TOX CREAM, BLACK SNAIL ALL IN ONE CREAM, APPLE SMOOTHIE PEELING GEL, REAL SOYBEAN DEEP CLEANSING OIL, COLLAGEN POWER LIFTING CREAM, SNAIL RECOVERY GEL CREAM" border="0" src="http://www.mizon.co.kr/images/upload/product/20150428113514_3.jpg" width="240"/></a></td>
</tr>
<tr>
<td align="center" height="43" valign="middle"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><span class="style3">ENJOY VITAL-UP TIME Lift Up Mask <br/>
                         Volume:25ml</span></a></td>
</tr>
</table>

對於表中的每個此類項目,我只想從上表的最后一個數據單元格中提取以下內容:

1) href = javascript:fnMoveDetail(7499) 中的四位數字

2) 類下的名稱:style3

3) 下類卷:style3

我的代碼中的下一行如下:

df = pd.read_html(str(tables), skiprows={0}, flavor="bs4")[0]
a_links = soup.find_all('a', attrs={'class':'style3'})
stnid_dict = {}
for a_link in a_links:
    cid = ((a_link['href'].split("javascript:fnMoveDetail("))[1].split(")")[0])
    stnid_dict[a_link.text] = cid

我的目標是使用這些數字轉到各個鏈接,然后將此頁面上抓取的信息與每個鏈接進行匹配。 解決這個問題的最佳方法是什么?

使用包含 javascript href 作為錨點a標簽,找到所有span ,然后獲取它的parent標簽。

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
spans = soup.select('td > a[href*="javascript:fnMoveDetail"] > span')
for span in spans:
    href = span.find_parent('a').get('href').strip('javascript:fnMoveDetail()')
    name, volume = span.get_text(strip=True).split('Volume:')
    print(name, volume, href)

出去:

Dust Clean up Peeling Toner 150ml 8235
Collagen Power Lifting EX Toner 150ml 8067
Collagen Power Lifting EX Emulsion 150ml 8068
Barrier Oil Toner 150ml 8059
Barrier Oil Emulsion 150ml 8060
BLACK CLEAN UP PORE WATER FINISHER 150ml 7650
Vita Lemon Sparkling Toner 150ml 7356
INTENSIVE SKIN BARRIER TONER 150ml 7110
INTENSIVE SKIN BARRIER EMULSION 150ml 7111

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM