Using xpath in python, how can I select only a subpart of elements?

Question

sorry if this is a super easy question, my python skills are not very advanced yet. For context, I am trying to scrape an issue website of the journal of communication ( https://academic.oup.com/joc/issue/67/1 ). What I am trying to do is essentially to get the title of each article with the corresponding authors. While getting the titles and authors generally is no problem, I am struggling with matching the authors with the titles. I tried to create a list of lists where there is one list with all authors for each title. However, everything I tried ended up the same, namely that I get a single list with all authors. Since the articles have a different amount of authors, its not possible to afterwards match the authors to the article.

My last try was this:

import os
import time
import requests
from lxml import html
import numpy as np
import pandas as pd
import time
from datetime import datetime
url = 'https://academic.oup.com/joc/issue/67/1'

#need for accesing website
headers = requests.utils.default_headers() #need this and next line because otherwise connection error
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'


response = requests.get(url, headers=headers)
tree = html.fromstring(response.text) #create html document

#title
titles_pre = tree.xpath('//div[@class = "al-article-item-wrap al-normal"]/div/h5/a/text()')
titles = [str(f) for f in titles_pre]

##authors
Authors = [] 
amount_articles = len(tree.xpath('//div[contains(@class, "al-article-item-wrap al-normal")]')) #amount of articles on the website
for i in range(0,amount_articles):
    num = i + 1
    authors_art = tree.xpath('(//div[@class= "al-authors-list"])[num]/span/a/text()')
    #authors 
amount_articles = len(tree.xpath('//div[contains(@class, "al-article-item-wrap al-normal")]')) #amount of articles on the website
for i in range(0,amount_articles):
    num = i + 1
    authors_art = tree.xpath('(//div[@class= "al-authors-list"])[num]/span/a/text()')
    #get only the numth div element where the author names are stored and extract the author 
    names
    Authors.append(authors_art)

My expectation was that I would go through each of the different div elements separately, extract the author information, append this to a list and then get the next div element with the next author information. However, the outcome was that for loop interaction "i" all author names from the whole website were grabbed and appended (instead of only the authors belonging to the i'th article). Can anyone help me and explain what I am doing wrong, and what I need to do instead to each time only select the specific authors I am interested in?

Answer 1

Try something simpler, along the lines of:

targets = tree.xpath('//div[@class="al-article-items"]')
for target in targets:
    print("Title: ",target.xpath('.//h5/a')[0].text)
    print('Author(a): ',[author.text for author in target.xpath('.//div/span/a')])
    print('-----')

Output:

Title:  Convergent News? A Longitudinal Study of Similarity and Dissimilarity in the Domestic and Global Coverage of the Israeli-Palestinian Conflict
Author(s):  ['Christian Baden', 'Keren Tenenboim-Weinblatt']
-----
Title:  Online Privacy Concerns and Privacy Management: A Meta-Analytical Review
Author(s):  ['Lemi Baruh', 'Ekin Secinti', 'Zeynep Cemalcilar']

etc.

Using xpath in python, how can I select only a subpart of elements?

Question

1 answers

solution1
0 ACCPTED 2020-07-12 17:30:37

Using xpath in python, how can I select only a subpart of elements?

Question

1 answers

solution1 0 ACCPTED 2020-07-12 17:30:37

solution1
0 ACCPTED 2020-07-12 17:30:37