简体   繁体   中英

Selenium python: get all the <li> text of all the <ul> from a <div>

I would like to get all the list of word that are as dutch word = english word from several pages.

By examining the HTML, it means that I need to get all the texts from all the li of all the ul from the child div of #mw-content-text .

Here is my code:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)

listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

Here is the output

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

I don't understand why some li text are not retrieve even though their xpath is the same (I double check several of them via the copy xpath of the developer console)

Try waiting for the page to fully load before parsing it, one way is to use the time.sleep() method:

from time import sleep
...

for url in listURL:
    driver.get(url)
    sleep(5)
    ...

EDIT: Using BeautifulSoup :

import requests
from bs4 import BeautifulSoup


listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    
    for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
        print(tag.text)
        print()
        print(tag.find_next("ul").text)
        print("-" * 80)
    print()

Output (truncated):

Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
Lesson 1

man = man
vrouw = woman
jongen = boy
ik = I
ben = am
een = a/an
en = and
--------------------------------------------------------------------------------
Lesson 2

meisje = girl
kind = child/kid
hij = he
ze = she (unstressed)
is = is
of = or
--------------------------------------------------------------------------------
Lesson 3

appel = apple

... And on

If you want the output as a list :

for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
    print("-" * 80)

Your script seems to be ok, but I'd add explicit or implicit wait. Try to wait till all elements on a page are visible:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

Also, you can add driver.implicitly_wait(15) right after you declare driver .

Output:

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', 'meisje = girl', 'kind = child/kid', 'hij = he', 'ze = she (unstressed)', 'is = is', 'of = or', 'appel = apple', 'melk = milk', 'drinkt = drinks (2nd and 3rd person singular)', 'drink = drink (1st person singular)', 'eet = eat(s) (singular)', 'de = the', 'sap = juice', 'water = water', 'brood = bread', 'het = it, the', 'je = you (singular informal, unstressed)', 'bent = are (2nd person singular)', 'Zijn (to be)', 'Hebben (to have)', 'Mogen (to be allowed to)', 'Willen (to want)', 'Kunnen (to be able to)', 'Zullen ("will")', 'boterham = sandwich', 'rijst = rice', 'we = we (unstressed)', 'jullie = you (plural informal)', 'eten = eat (plural)', 'drinken = drink (plural)', 'vrouwen = women', 'mannen = men', 'meisjes = girls', 'krant = newspaper', 'lezen = read (plural)', 'jongens = boys', 'menu = menu', 'dat = that', 'zijn = are (plural)', 'ze = they (unstressed)', 'heb = have (1st person singular)', 'heeft = has (3rd person singular)', 'hebt = have (2nd person singular)', 'hebben = have (plural)', 'boek = book', 'lees = read (1st person singular)', 'leest = read(s) (2nd and 3rd person singular)', 'kinderen = children', 'spreken = speak (plural)', 'spreek = speak (1st person singular)', 'spreekt = speak(s) (2nd and 3rd person singular)', 'hallo = hello', 'bedankt = thanks', 'doei = bye', 'dag = goodbye', 'tot ziens = see you later', 'hoi = hi', 'goedemorgen = good morning', 'goededag = good day', 'goedenavond = good evening', 'goedenacht = good night', 'welterusten = good night', 'ja = yes', 'dank je wel = thank you very much', 'alsjeblieft = please', 'sorry = sorry', 'het spijt me = I am sorry', 'oké = okay', 'pardon = excuse me', 'hoe gaat het = how are you', 'goed = good, fine, well', 'dank je = thank you', '(een) beetje = (a) bit of', 'Engels = English', 'Nederlands = Dutch', 'Geen: negating indefinite nouns (you can think of it as "no" things or "none of" a thing if that helps). Geen replaces the indefinite pronoun in question.', 'Niet: negating a verb, adjective or definite nouns. Niet comes at the end of a sentence or directly after the verb zijn.', 'nee = no', 'niet = not', 'geen = not']

Update: I found a more reliable way with CSS selectors. Try it please:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
driver.implicitly_wait(15)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
wait = WebDriverWait(driver, 15)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.mw-parser-output>ul')))
    elem = driver.find_elements_by_css_selector('.mw-parser-output>ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_css_selector("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

Update 2 After trying to understand the reason I found out that ads take the most of the time of loading. So I'm adding wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] "))) that waits till all ads are loaded.

I also changed the second wait to .mw-parser-output>ul by removing last li . I think it is not necessary. You can also try removing the second wait and see if it helps.

After

WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))

you need to add some sleep, I guess time.sleep(1) will be enough and only after that do

elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')

Your problem is caused by misunderstanding visibility_of_all_elements_located functionality.
It is not actually waiting for all the elements located by the locator you passing it to become visible, it has no idea for what amount of such elements to wait.
So once it detects at least 1 element matching your locator visible - it returns the list of detected elements and the program continues forward.
See more details about those methods here and in the official documentation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM