如何使用 lxml 进行网页抓取？

Question

I want to write a python script that fetches my current reputation on stack overflow -- https://stackoverflow.com/users/14483205/raunanza?tab=profile我想编写一个 python 脚本来获取我当前在堆栈溢出方面的声誉——https://stackoverflow.com/users/14483205/raunanza?tab =profile

This is the code I have written.这是我写的代码。

from lxml import html 
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content)

Now, what to do to fetch my reputation.现在，我该怎么做才能获得我的声誉。 (I can't understand how to use xpath even （我什至无法理解如何使用 xpath
after googling it.)谷歌搜索后。）

Answer 1

Simple solution using lxml and beautifulsoup :使用lxml和beautifulsoup简单解决方案：

from lxml import html
from bs4 import BeautifulSoup
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile').text
tree = BeautifulSoup(page, 'lxml')
name = tree.find("div", {'class': 'grid--cell fw-bold'}).text
title = tree.find("div", {'class': 'grid--cell fs-title fc-dark'}).text
print("Stackoverflow reputation of {}is: {}".format(name, title))
# output: Stackoverflow reputation of Raunanza is: 3

Answer 2

If you don't mind using BeautifulSoup , you can directly extract the text from the tag which contains your reputation.如果您不介意使用BeautifulSoup ，您可以直接从包含您的声誉的标签中提取文本。 Of course you need to check page structure first.当然，您需要先检查页面结构。

from bs4 import BeautifulSoup
import requests

page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
soup = BeautifulSoup(page.content, features= 'lxml')

for tag in soup.find_all('strong', {'class': 'ml6 fc-medium'}):
    print(tag.text)
#this will output as 3

Answer 3

You need to make some modifications in your code to get the xpath.您需要对代码进行一些修改才能获得 xpath。 Below is the code:下面是代码：

from lxml import HTML 
import requests

page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content) 
title = tree.xpath('//*[@id="avatar-card"]/div[2]/div/div[1]/text()')
print(title) #prints 3

You can easily get the xpath of element in chrome console(inspect option).您可以在 chrome 控制台（检查选项）中轻松获取元素的 xpath。

To learn more about xpath you can refer: https://www.w3schools.com/xml/xpath_examples.asp要了解有关 xpath 的更多信息，您可以参考： https : //www.w3schools.com/xml/xpath_examples.asp

如何使用 lxml 进行网页抓取？

问题描述

3 个解决方案

解决方案1
0 2020-10-22 07:02:33

解决方案2
0 2020-10-22 07:06:34

解决方案3
0 已采纳 2020-10-22 07:18:34

如何使用 lxml 进行网页抓取？

问题描述

3 个解决方案

解决方案1 0 2020-10-22 07:02:33

解决方案2 0 2020-10-22 07:06:34

解决方案3 0 已采纳 2020-10-22 07:18:34

解决方案1
0 2020-10-22 07:02:33

解决方案2
0 2020-10-22 07:06:34

解决方案3
0 已采纳 2020-10-22 07:18:34