使用python和beautlfulsoup从网站的href中提取文本

Question

我正在尝试从网站上抓取数据，我需要文字标题。

[<a href="http://www.thegolfcourses.net/golfcourses/TX/38468.htm" rel="bookmark">Feather Bay  Golf  Course and Resort</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/AZ/174830.htm" rel="bookmark">Paradise Valley Country Club</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/IL/129935.htm" rel="bookmark">The Golf Club at Waters Edge</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/NY/10630.htm" rel="bookmark">1000 Acres Ranch Resort</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/VA/995731.htm" rel="bookmark">1757 Golf Club, 1757 Golf Club Front 9 Golf Course</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/WI/320815.htm" rel="bookmark">27 Pines Golf Course</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/WY/823145.htm" rel="bookmark">3 Creek Ranch Golf Club</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/CA/18431.htm" rel="bookmark">3 Par At Four Points</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/AZ/470720.htm" rel="bookmark">3 Parks Fairways</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/IA/074920.htm" rel="bookmark">3-30 Golf &amp; Country Club</a>]

我使用这段代码来处理它，但是我很难将代码提取出来，以获取有关如何执行此操作的任何好方法？

import csv
import requests 
from bs4 import BeautifulSoup

courses_list = []

for i in range(1):
 url="http://www.thegolfcourses.net/page/{}?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750".format(i)
 r = requests.get(url)
 soup = BeautifulSoup(r.content)


g_data2=soup.find_all("article")


for item in g_data2:
  try:
    name= item.contents[5].find_all("a")
    print name
  except:
        name=''

Answer 1

使用string属性

name= item.contents[5].find_all("a")[0].string

请记住， findall返回一个列表（ResultSet对象），因此，如果您知道只有一个列表，则只需在该列表中查找第0个索引。

或者，如果知道只有一个结果感兴趣，则可以使用find代替。

name= item.contents[5].find("a").string

Answer 2

如果我正确理解，这可能会起作用。 在BeautifulSoup / python中是否有InnerText等效项？

基本上尝试“ .text”方法

name = item.contents[5].find_all("a").text

编辑：对不起，我无法正确格式化，请尝试一下，这很不好，但是可以

x = "<a> text </a>"
y = x.split(">")[1]
z = y.split("<")[0]
print z
 text

使用python和beautlfulsoup从网站的href中提取文本

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-06-29 22:44:37

解决方案2
0 2015-06-29 22:29:38

使用python和beautlfulsoup从网站的href中提取文本

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-06-29 22:44:37

解决方案2 0 2015-06-29 22:29:38

解决方案1
2 已采纳 2015-06-29 22:44:37

解决方案2
0 2015-06-29 22:29:38