简体   繁体   English

使用 Python/Beautiful Soup/pandas 只从表格中抓取选定的文本

[英]Scrape only selected text from tables using Python/Beautiful soup/pandas

I am new to Python and am using beautiful soup for web scraping for a project.我是 Python 的新手,正在使用漂亮的汤进行项目的网络抓取。

I am hoping to only get parts of the text in a list/dictionary.我希望只在列表/字典中获取部分文本。 I started with the following code:我从以下代码开始:

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')

This helped me parse data into tables and ONE of the items from table looked as below:这帮助我将数据解析为表格,表格中的一项如下所示:

<table border="0" cellpadding="0" cellspacing="0" width="235">
<tr>
<td align="center" height="238"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><img alt="LL IN ONE SNAIL REPAIR CREAM, SNAIL REPAIR BLEMISH BALM, WATERMAX MOISTURE B.B CREAM, WATERMAX AQUA GEL CREAM, CORRECT COMBO CREAM, GOLD STARFISH ALL IN ONE CREAM, S-VENOM WRINKLE TOX CREAM, BLACK SNAIL ALL IN ONE CREAM, APPLE SMOOTHIE PEELING GEL, REAL SOYBEAN DEEP CLEANSING OIL, COLLAGEN POWER LIFTING CREAM, SNAIL RECOVERY GEL CREAM" border="0" src="http://www.mizon.co.kr/images/upload/product/20150428113514_3.jpg" width="240"/></a></td>
</tr>
<tr>
<td align="center" height="43" valign="middle"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><span class="style3">ENJOY VITAL-UP TIME Lift Up Mask <br/>
                         Volume:25ml</span></a></td>
</tr>
</table>

For each such item in the table, I would like to extract only the following from the last data cell in table above:对于表中的每个此类项目,我只想从上表的最后一个数据单元格中提取以下内容:

1) The four digit number in a href = javascript:fnMoveDetail(7499) 1) href = javascript:fnMoveDetail(7499) 中的四位数字

2) Name under class:style3 2) 类下的名称:style3

3) volume under class:style3 3) 下类卷:style3

The next lines in my code were as follows:我的代码中的下一行如下:

df = pd.read_html(str(tables), skiprows={0}, flavor="bs4")[0]
a_links = soup.find_all('a', attrs={'class':'style3'})
stnid_dict = {}
for a_link in a_links:
    cid = ((a_link['href'].split("javascript:fnMoveDetail("))[1].split(")")[0])
    stnid_dict[a_link.text] = cid

My objective is to use the numbers to go to individual links and then match the info scraped on this page to each link.我的目标是使用这些数字转到各个链接,然后将此页面上抓取的信息与每个链接进行匹配。 What would be the best way to approach this?解决这个问题的最佳方法是什么?

use a tag which contains javascript href as anchor, find all span and then get it's parent tag.使用包含 javascript href 作为锚点a标签,找到所有span ,然后获取它的parent标签。

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
spans = soup.select('td > a[href*="javascript:fnMoveDetail"] > span')
for span in spans:
    href = span.find_parent('a').get('href').strip('javascript:fnMoveDetail()')
    name, volume = span.get_text(strip=True).split('Volume:')
    print(name, volume, href)

out:出去:

Dust Clean up Peeling Toner 150ml 8235
Collagen Power Lifting EX Toner 150ml 8067
Collagen Power Lifting EX Emulsion 150ml 8068
Barrier Oil Toner 150ml 8059
Barrier Oil Emulsion 150ml 8060
BLACK CLEAN UP PORE WATER FINISHER 150ml 7650
Vita Lemon Sparkling Toner 150ml 7356
INTENSIVE SKIN BARRIER TONER 150ml 7110
INTENSIVE SKIN BARRIER EMULSION 150ml 7111

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM