I am trying to extract the second column of the following table, ie the names of the muscles: http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html
Here's my code so far:
import requests
import time
from bs4 import BeautifulSoup as soup
url = "http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html"
links = []
time.sleep(1)
print(url)
page = requests.get(url)
text = soup(page.text, 'html.parser')
table = text.select('table')[1]
rows = table.find_all('tr')[2:]
names = []
for row in rows:
names.append(row.find_all('td')[1].text.replace('\n', ''))
print(names)
The problem is that it sometimes gets me the second column of the row and sometimes the third, depending on if the first column extends over two lines or not. Makes sense, but I can't figure out how to solve it.
Thankful for any ideas!
Try this:
import pandas as pd
url = 'http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html'
tables = pd.read_html(url)
print(tables[1][1])
Output is the column headed 'Muskel - muscle (Terminologia anatomica)'.
You may take into account the fact that the second rows have always a specific width: width="15%"
. You may try to select, in each line, the cells which have this width (be careful of the fact that the last columns have sometimes the same property, you should get the first element selected).
You can use an attribute selector in combination with type selector to target the a
type/tag elements having a name
attribute. More lightweight than pandas especially if you just want those muscle names.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('http://www.drjastrow.de/WAI/Vokabular/Muskeln-A1.html')
soup = bs(r.content,'lxml')
muscles = [a['name'] for a in soup.select('a[name]')]
print(muscles)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.