简体   繁体   English

使用 LXML 从 html 文件中获取 xpath - Python

[英]Get xpath from html file using LXML - Python

I am learning how to parse documents using lxml.我正在学习如何使用 lxml 解析文档。 To do so, I'm trying to parse my linkedin page.为此,我正在尝试解析我的链接页面。 It has plenty of information and I thought it would be a good training.它有很多信息,我认为这将是一次很好的培训。

Enough with the context.足够的上下文。 Here what I'm doing:这是我在做什么:

  1. going to the url: https://www.linkedin.com/in/NAME/转到网址: https : //www.linkedin.com/in/NAME/
  2. opening and saving the source code to as "linkedin.html"打开源代码并将其保存为“linkedin.html”
  3. as I'm trying to extract my current job, I'm doing the following:当我试图提取我当前的工作时,我正在执行以下操作:
from io import StringIO, BytesIO
from lxml import html, etree

# read file
filename = 'linkedin.html'
file = open(filename).read()

# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)

# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)

The tree variable's type is树变量的类型是

But it always return an empty list for my variable title.但它总是为我的变量标题返回一个空列表。

I've been trying all day but still don't understand what I'm doing wrong.我已经尝试了一整天,但仍然不明白我做错了什么。

I've find the answer to my problem by adding an encoding parameter within the open() function.通过在 open() 函数中添加编码参数,我找到了问题的答案。

Here what I've done:这是我所做的:

def parse_html_file(filename):
    f = open(filename, encoding="utf8").read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(f), parser)
    return tree


tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[@class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM