I am really new to Python, so I am still trying to figure out Beautiful Soup. I am trying to scrape a website and pull five elements that immediately follow the tag I have found in my code.
I have tried next.element, which only pulls the text of the tag that I used in my soup.find, and I have tried next.sibling, which returns as blank.
There are a number of 'first' and 'last' classes on the page, so I have to specify which line I want with the text. Here is what I am trying to scrape:
<li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>
This what I am trying:
for x,y in zip(make, model):
url = ('https://URL with variables goes here')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
search = requests.get(url, headers = headers)
html = search.text
soup = BeautifulSoup(html, 'lxml')
search_results = soup.find('li', class_ = 'first', text = re.compile('Maintenance'))
try:
d = search_results.next_element
print(d)
except:
print('pass')
The ultimate goal will be to append the array of number1:number5 into a list, but with the code above, the output is just 'Maintenance'. Where am I going wrong? Also, since I am so new, if you are able to provide context as well, I would be very appreciative.
Given your example, simplest way would be to append to the results list all the li
elements that don't have a class defined.
from bs4 import BeautifulSoup
html = """ <li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>"""
soup = BeautifulSoup(html, 'lxml')
start = soup.find('li', class_ = 'first').parent
result = []
for ele in start.find_all('li'):
if not ele.get('class'):
result.append(ele.text)
print(result)
Outputs:
['$number1', '$number2', '$number3', '$number4', '$number5']
You could use an xpath expression with something like tree.xpath
//li[@class='first' and text()='Maintenance']/following-sibling::li[not(@class)]
Eg
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//li[@class='first' and text()='Maintenance']/following-sibling::li[not(@class)]")]
print(items)
Something along the line of QHarr's answer, but somewhat different:
h = '''
<li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>
'''
from lxml import etree
doc = etree.fromstring(h)
for cost in doc.xpath('//li'):
if not 'class' in cost.attrib:
print(cost.text)
Output:
$number1
$number2
$number3
$number4
$number5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.