在 python 中使用 xpath，我怎么能 select 只是元素的子部分？

Question

抱歉，如果这是一个超级简单的问题，我的 python 技能还不是很先进。 对于上下文，我正在尝试抓取通讯期刊的问题网站（ https://academic.oup.com/joc/issue/67/1 ）。 我要做的基本上是与相应的作者一起获得每篇文章的标题。 虽然获得标题和作者通常没有问题，但我正在努力将作者与标题匹配。 我试图创建一个列表列表，其中每个标题都有一个包含所有作者的列表。 但是，我尝试的所有内容都以相同的方式结束，即我得到了一个包含所有作者的列表。 由于文章的作者数量不同，因此不可能事后将作者与文章匹配。

我最后一次尝试是这样的：

import os
import time
import requests
from lxml import html
import numpy as np
import pandas as pd
import time
from datetime import datetime
url = 'https://academic.oup.com/joc/issue/67/1'

#need for accesing website
headers = requests.utils.default_headers() #need this and next line because otherwise connection error
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'


response = requests.get(url, headers=headers)
tree = html.fromstring(response.text) #create html document

#title
titles_pre = tree.xpath('//div[@class = "al-article-item-wrap al-normal"]/div/h5/a/text()')
titles = [str(f) for f in titles_pre]

##authors
Authors = [] 
amount_articles = len(tree.xpath('//div[contains(@class, "al-article-item-wrap al-normal")]')) #amount of articles on the website
for i in range(0,amount_articles):
    num = i + 1
    authors_art = tree.xpath('(//div[@class= "al-authors-list"])[num]/span/a/text()')
    #authors 
amount_articles = len(tree.xpath('//div[contains(@class, "al-article-item-wrap al-normal")]')) #amount of articles on the website
for i in range(0,amount_articles):
    num = i + 1
    authors_art = tree.xpath('(//div[@class= "al-authors-list"])[num]/span/a/text()')
    #get only the numth div element where the author names are stored and extract the author 
    names
    Authors.append(authors_art)

我的期望是，我将 go 分别通过每个不同的 div 元素，提取作者信息 append 到一个列表，然后使用下一个作者信息获取下一个 div 元素。 然而，结果是 for loop 交互“i” 整个网站的所有作者姓名都被抓取并附加（而不是仅属于第 i 篇文章的作者）。 任何人都可以帮助我并解释我做错了什么，以及我需要做什么，而不是每次只 select 我感兴趣的特定作者？

Answer 1

尝试一些更简单的方法，如下所示：

targets = tree.xpath('//div[@class="al-article-items"]')
for target in targets:
    print("Title: ",target.xpath('.//h5/a')[0].text)
    print('Author(a): ',[author.text for author in target.xpath('.//div/span/a')])
    print('-----')

Output：

Title:  Convergent News? A Longitudinal Study of Similarity and Dissimilarity in the Domestic and Global Coverage of the Israeli-Palestinian Conflict
Author(s):  ['Christian Baden', 'Keren Tenenboim-Weinblatt']
-----
Title:  Online Privacy Concerns and Privacy Management: A Meta-Analytical Review
Author(s):  ['Lemi Baruh', 'Ekin Secinti', 'Zeynep Cemalcilar']

等等

在 python 中使用 xpath，我怎么能 select 只是元素的子部分？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-12 17:30:37

在 python 中使用 xpath，我怎么能 select 只是元素的子部分？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-12 17:30:37

解决方案1
0 已采纳 2020-07-12 17:30:37