在Python中使用Beautifulsoup迭代xml中的非href鏈接並檢索特定信息

Question

我是一個python初學者，剛開始學習使用Bsoup抓取網站。

我正在嘗試從本網站上的所有單獨鏈接中提取聯系信息（地址，公司名稱）。

一般來說，我知道如何在典型的html源代碼中檢索hrefs列表，但由於這是一個xml，我只能將這些鏈接隔離出來，形式如下：

[你' http://www.agenzia-interinale.it/milano ']

到目前為止，我的代碼為我提供了該格式的所有公司鏈接，但我不知道如何通過每個鏈接並提取相關信息。

from bs4 import BeautifulSoup
import requests
import re

resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'

html = requests.get(url1).text
bs = BeautifulSoup(html)
# find the links to companies
company_menu = bs.find_all('loc')
for company in company_menu:
    print company.contents

從該鏈接列表中，首先需要確定該頁面是否具有聯系信息，然后如果它在此示例中執行，則它應該提取地址/公司名稱。

我相信我正在尋找的最終信息可以通過這個div過濾器隔離：

bs.find_all("div",{'style':'vertical-align:middle;'})

我已經嘗試過嵌套循環，但我無法讓它工作。

任何輸入真的很感激！

Answer 1

沒有必要使用BeautifulSoup。 該站點返回完全有效的XML，可以使用Python包含的工具進行解析：

import requests
import xml.etree.ElementTree as et

req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
    print i[0].text  # the <loc> text

Answer 2

根據你的要求，你想從xml獲取url，但是你正在尋找格式化xml的css標簽......所以錯誤的方式。

嘗試這個：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2 
from BeautifulSoup import BeautifulSoup

url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'

f = urllib2.urlopen(url1)

bs = BeautifulSoup(f)

for url in bs.findAll("loc"):
    print url.string

請注意，我正在使用findAll（）方法，並查找“loc”標記，其中包含您要檢索的數據。

在Python中使用Beautifulsoup迭代xml中的非href鏈接並檢索特定信息

問題描述

2 個解決方案

解決方案1
2 已采納 2013-12-18 22:56:59

解決方案2
2 2013-12-18 23:30:24

在Python中使用Beautifulsoup迭代xml中的非href鏈接並檢索特定信息

問題描述

2 個解決方案

解決方案1 2 已采納 2013-12-18 22:56:59

解決方案2 2 2013-12-18 23:30:24

解決方案1
2 已采納 2013-12-18 22:56:59

解決方案2
2 2013-12-18 23:30:24