简体   繁体   中英

Python: Scraping table/ get a specific column when the first column is not always equal

I am trying to extract the second column of the following table, ie the names of the muscles: http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html

Here's my code so far:

    import requests
    import time
    from bs4 import BeautifulSoup as soup

    url = "http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html"
    links = []
    time.sleep(1)
    print(url)
    page = requests.get(url)
    text = soup(page.text, 'html.parser')
    table = text.select('table')[1]
    rows = table.find_all('tr')[2:]

    names = []
    for row in rows:
        names.append(row.find_all('td')[1].text.replace('\n', ''))

    print(names)

The problem is that it sometimes gets me the second column of the row and sometimes the third, depending on if the first column extends over two lines or not. Makes sense, but I can't figure out how to solve it.

Thankful for any ideas!

Try this:

import pandas as pd

url = 'http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html'

tables = pd.read_html(url)
print(tables[1][1])

Output is the column headed 'Muskel - muscle (Terminologia anatomica)'.

You may take into account the fact that the second rows have always a specific width: width="15%" . You may try to select, in each line, the cells which have this width (be careful of the fact that the last columns have sometimes the same property, you should get the first element selected).

You can use an attribute selector in combination with type selector to target the a type/tag elements having a name attribute. More lightweight than pandas especially if you just want those muscle names.

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html')
soup = bs(r.content,'lxml')
muscles = [a['name'] for a in soup.select('a[name]')]
print(muscles)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM