简体   繁体   中英

Get xpath from html file using LXML - Python

I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.

Enough with the context. Here what I'm doing:

  1. going to the url: https://www.linkedin.com/in/NAME/
  2. opening and saving the source code to as "linkedin.html"
  3. as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree

# read file
filename = 'linkedin.html'
file = open(filename).read()

# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)

# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)

The tree variable's type is

But it always return an empty list for my variable title.

I've been trying all day but still don't understand what I'm doing wrong.

I've find the answer to my problem by adding an encoding parameter within the open() function.

Here what I've done:

def parse_html_file(filename):
    f = open(filename, encoding="utf8").read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(f), parser)
    return tree


tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[@class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM