简体   繁体   English

从网站链接提取

[英]Link extraction from website

I am trying to extract some data from WebMD and once I run my code I keep geeting a "None" as a return. 我试图从WebMD中提取一些数据,一旦我运行我的代码,我就会继续将“无”作为回报。 Any idea what I am doing wrong. 知道我做错了什么。 I have the number of returns the same as the number of links but I do not get the links. 我的返回数量与链接数量相同,但我没有得到链接。

import bs4 as bs
import urllib.request
import pandas as pd


source = urllib.request.urlopen('https://messageboards.webmd.com/').read()

soup = bs.BeautifulSoup(source,'lxml')

for url in soup.find_all('div',class_="link"):
    print (url.get('href'))

Your url element is actually a div tag, not an a : 你的url元素实际上是一个div标签,而不是a

>>> x = soup.find_all('div', class_="link")
>>> x[0]
<div class="link"><a href="https://messageboards.webmd.com/family-pregnancy/f/relationships/">Relationships</a></div>

You need to select the child before getting the href attribute: 您需要在获取href属性之前选择子项:

>>> x[0].a.get('href')
'https://messageboards.webmd.com/family-pregnancy/f/relationships/'

Just modify your for loop as follows: 只需按如下方式修改for循环:

for url in soup.find_all('div',class_="link"):
    print (url.a.get('href'))

soup.find_all('div',class_="link") returns all div elements with the class link . soup.find_all('div',class_="link")返回带有类link所有div元素。 These elements wrap the a elements that contain the href attributes, so you need to get the href from the correct element like so: 这些元素包含了包含href属性a元素,因此您需要从正确的元素中获取href,如下所示:

for div in soup.find_all('div',class_="link"):
    print (div.a.get('href'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM