[英]how can I select all li elements with a span of text value equal to a number using xpath not absolute?
[英]Using xpath in python, how can I select only a subpart of elements?
抱歉,如果这是一个超级简单的问题,我的 python 技能还不是很先进。 对于上下文,我正在尝试抓取通讯期刊的问题网站( https://academic.oup.com/joc/issue/67/1 )。 我要做的基本上是与相应的作者一起获得每篇文章的标题。 虽然获得标题和作者通常没有问题,但我正在努力将作者与标题匹配。 我试图创建一个列表列表,其中每个标题都有一个包含所有作者的列表。 但是,我尝试的所有内容都以相同的方式结束,即我得到了一个包含所有作者的列表。 由于文章的作者数量不同,因此不可能事后将作者与文章匹配。
我最后一次尝试是这样的:
import os
import time
import requests
from lxml import html
import numpy as np
import pandas as pd
import time
from datetime import datetime
url = 'https://academic.oup.com/joc/issue/67/1'
#need for accesing website
headers = requests.utils.default_headers() #need this and next line because otherwise connection error
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
response = requests.get(url, headers=headers)
tree = html.fromstring(response.text) #create html document
#title
titles_pre = tree.xpath('//div[@class = "al-article-item-wrap al-normal"]/div/h5/a/text()')
titles = [str(f) for f in titles_pre]
##authors
Authors = []
amount_articles = len(tree.xpath('//div[contains(@class, "al-article-item-wrap al-normal")]')) #amount of articles on the website
for i in range(0,amount_articles):
num = i + 1
authors_art = tree.xpath('(//div[@class= "al-authors-list"])[num]/span/a/text()')
#authors
amount_articles = len(tree.xpath('//div[contains(@class, "al-article-item-wrap al-normal")]')) #amount of articles on the website
for i in range(0,amount_articles):
num = i + 1
authors_art = tree.xpath('(//div[@class= "al-authors-list"])[num]/span/a/text()')
#get only the numth div element where the author names are stored and extract the author
names
Authors.append(authors_art)
我的期望是,我将 go 分别通过每个不同的 div 元素,提取作者信息 append 到一个列表,然后使用下一个作者信息获取下一个 div 元素。 然而,结果是 for loop 交互“i” 整个网站的所有作者姓名都被抓取并附加(而不是仅属于第 i 篇文章的作者)。 任何人都可以帮助我并解释我做错了什么,以及我需要做什么,而不是每次只 select 我感兴趣的特定作者?
尝试一些更简单的方法,如下所示:
targets = tree.xpath('//div[@class="al-article-items"]')
for target in targets:
print("Title: ",target.xpath('.//h5/a')[0].text)
print('Author(a): ',[author.text for author in target.xpath('.//div/span/a')])
print('-----')
Output:
Title: Convergent News? A Longitudinal Study of Similarity and Dissimilarity in the Domestic and Global Coverage of the Israeli-Palestinian Conflict
Author(s): ['Christian Baden', 'Keren Tenenboim-Weinblatt']
-----
Title: Online Privacy Concerns and Privacy Management: A Meta-Analytical Review
Author(s): ['Lemi Baruh', 'Ekin Secinti', 'Zeynep Cemalcilar']
等等
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.