I wrote some code to search html, but the result was not what I wanted. some html code I would like to pull the page addresses I want to get the word "sayfa" Examples:
http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa2
http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa3
but I don't know how to do it
<table cellpadding="0" cellspacing="0" border="0" width="100%" style="margin-bottom:3px">
<tr valign="bottom">
<td class="smallfont"><a href="http://www.vbulletin.com.tr/newthread.php?do=newthread&f=16" rel="nofollow"><img src="http://www.vbulletin.com.tr/images/fsimg/butonlar/newthread.gif" alt="Yeni Konu Oluştur" border="0" /></a></td>
<td align="right"><div class="pagenav" align="right">
<table class="tborder" cellpadding="3" cellspacing="1" border="0">
<tr>
<td class="vbmenu_control" style="font-weight:normal">Sayfa 1 Toplam 5 Sayfadan</td>
<td class="alt2"><span class="smallfont" title="Toplam 100 sonuçtan 1 ile 20 arası sonuç gösteriliyor."><strong>1</strong></span></td>
<td class="alt1"><a class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa2/" title="Toplam 100 sonuçtan 21 ile 40 arası sonuç gösteriliyor.">2</a></td><td class="alt1"><a class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa3/" title="Toplam 100 sonuçtan 41 ile 60 arası sonuç gösteriliyor.">3</a></td>
<td class="alt1"><a rel="next" class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa2/" title="Sonraki Sayfa - Toplam 100 sonuçtan 21 ile 40 arası sonuç gösteriliyor.">></a></td>
<td class="alt1" nowrap="nowrap"><a class="smallfont" href="http://www.vbulletin.com.tr/vbulletin-temel-bilgiler/sayfa5/" title="Sonuncu Sayfa - Toplam 100 sonuçtan 81 ile 100 arası sonuç gösteriliyor.">Son Sayfa <strong>»</strong></a></td>
<td class="vbmenu_control" title="forumdisplay.php?f=16&order=desc"><a name="PageNav"></a></td>
</tr>
</table>
</div></td>
</tr>
</table>
I want to take the 'href'
import urllib2,re
from bs4 import BeautifulSoup
liste=[]
domain="http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
page = urllib2.urlopen(domain).read()
soup = BeautifulSoup(page)
soup.prettify()
for span in soup.findAll('span'):
print span["href"]
for span in soup.findAll('span'):
if span.a:
print span.a["href"]
In a list comp:
urls = [span.a["href"] for span in soup.findAll('span') if span.a]
If you print span.a
in the loop you will see None
sometimes so you need to check if span.a
before using span.a["href"]
or you will get a TypeError: 'NoneType' object has no attribute '__getitem__'
You could use a set comp as there are duplicated urls:
urls = {span.a["href"] for span in soup.findAll('span') if span.a}
Then search for any url you need:
for url in sorted(urls):
if "sayfa" in url:
print url
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/
In [26]: import urllib2
In [27]: from bs4 import BeautifulSoup
In [28]: domain="http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
In [29]: page = urllib2.urlopen(domain).read()
In [30]: soup = BeautifulSoup(page)
In [31]: urls = {span.a["href"] for span in soup.findAll('span') if span.a}
In [32]: for url in sorted(urls):
....: if "sayfa" in url:
....: print url
....:
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/
Try this,
from BeautifulSoup import BeautifulSoup
import requests
domain = "http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
page = requests.get(domain)
result = BeautifulSoup(page.text)
anc = result.findAll("span")
for values in range(len(anc)):
anchor = anc[values].findAll('a')
for i in anchor:
if "javascript" not in i.get('href') and "sayfa" in i.get('href'):
print i.get('href')
This will fetch you the href links.
Output:
http://www.forumsokagi.com/forum.php
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
etc...
Assuming that you want urls that has word sayfa .
You can also use lxml
to do it.
import urllib2
import lxml.html
domain="http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/"
data=urllib2.urlopen(domain).read()
tree = lxml.html.fromstring(data)
for i in tree.xpath('//a/@href'):
if "sayfa" in i:
print i
Output:
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa3/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa4/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa2/
http://www.forumsokagi.com/peygamber-ve-evliyalarin-hayatlari/sayfa7/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.