使用python中的ElementTree xpath解析html文件（到csv）時遇到問題

Question

我正在嘗試解析數千個html文件並將變量轉儲到csv文件（excel電子表格）中。 我遇到了幾個障礙-第幾天是（感謝）幾天前在這里解決的。 （希望）最終的障礙是：我無法使用xpath正確解析文件。 以下是簡要說明，python代碼和html代碼示例。

麻煩從這里開始：

for node in tree.iter():
            name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
            if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
            category=node.text

它運行，但不解析。 我沒有任何回溯錯誤。

我想我誤解了使用ElementTree進行解析的邏輯。

有幾個相同的標頭-因此很難找到唯一的ID /標頭。 這是html的示例：

<span class="s1">Business: Give Back to the Community and Save Money 
on Equipment, Technology, Promotional Products, and Market<span 
class="Apple-converted-space">&nbsp;</span></span>

對於xpath是：

    /html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]
/table/tbody/tr[1]/td[1]/p/span

我想從這個范圍（以及其他）中抓取文字，並將其放在excel電子表格中。

您可以在此處看到類似頁面的示例

無論如何，因為許多跨度/標頭沒有唯一標識，所以我認為我應該使用xpath。 但是，我還不能弄清楚如何將xpath命令與ElementTree成功一起使用。 在搜索文檔時，這個問題（以及邏輯）的答案一直困擾着我。 我已經在http://lxml.de/parsing.html以及此站點上進行了閱讀，但尚未找到有效的方法。

到目前為止，代碼很好地遍歷了所有文件（在保管箱中）。 它還創建了csv文件並創建了標頭（盡管不在單獨的列中，僅以分號分隔的一行作為一行，但這應該很容易修復）。

總而言之，我希望它從每個文件（網頁）中的不同行上解析文本並將其轉儲到excel文件中。

任何投入將不勝感激。

python代碼：

import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
import lxml.html
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])

 # redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"

allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"

for file in allFiles:
    #print allFiles
    if '.html' in file:
        print "in html loop"
        tree = lxml.html.parse(HTML_PATH+"/"+file)
        print '===================='
        print 'Parsing file: '+file
        print '===================='
        for node in tree.iter():
            name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')

            if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
            print 'Category:'
            category=node.text

f.close()

2015年6月14日（最新更改）； 我剛剛更改了此部分

        for node in tree.iter():
            name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')

            if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
            print 'Category:'
            category=node.text

對此：

    for node in tree.iter():
            row = dict.fromkeys(cols)
            Category_name = tree.xpath('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
            row['category'] = Category_name[0].text_content().encode('utf-8')

它仍然運行，但不解析。

Answer 1

嘗試以下代碼：

from lxml import etree 
import requests
from StringIO import StringIO

data = requests.get('http://www.usprwire.com/Detailed/Banking_Finance_Investment/Confused.com_reveals_that_Life_Insurance_is_more_than_a_form_of_future_protection_284764.shtml').content
parser = etree.HTMLParser()
root = etree.parse(StringIO(data), parser)
category = root.xpath('//table/td/font/text()')
print category[0]

它使用requests庫下載頁面的html代碼。 您可以選擇適合您需要的任何方法。 重要的部分是xpath ，它搜索任何<table>后跟<td>后跟<font> ，並返回包含兩個元素的列表。 第二個是空白字符，第一個包含文本。

運行它並生成您要查找的句子：

Banking, Finance & Investment: Confused.com reveals that Life Insurance is more than a form of future protection

使用python中的ElementTree xpath解析html文件（到csv）時遇到問題

問題描述

1 個解決方案

解決方案1
0 2015-06-14 22:55:27

使用python中的ElementTree xpath解析html文件（到csv）時遇到問題

問題描述

1 個解決方案

解決方案1 0 2015-06-14 22:55:27

解決方案1
0 2015-06-14 22:55:27