Using Python to extract string from title tag from html

Question

I'm trying to extract every something within title=" something" in .html file below using python.

<a class="BoxA" href="https://www.somethingsomething1.com" title=" AppleJuce 50x 122L">
...
</a></td>
<a class="BoxA" href="https://www.somethingsomething2.com" title=" AppleJam 100x 300L ">
...
</a></td>
and so on

Based on my search I think I should use

from lxml import html
import requests
import re

with open(r'C:\Users\Me\Desktop\1.html', "rb") as f:
    page = f.read()
tree = html.fromstring(page)
Titles= tree.xpath(...)

but I'm having trouble with ...somecode inside the of Titles= tree.xpath(...somecode)

Or is there any other way to do this? Thank you.

Also, I'd like to have AppleJuce 50x and their size 122L stored in two different lists, but don't know how to find a number before whitespace from the end of a string.

This is what I have so far for splitting the strings:

for title in Titles:
    number = re.search('\d', title)
    Apple= [title[:number.start()]]  #?????Is this right?
    size = [title[number.start():]]  #?????Is this right?

Answer 1

titleRegEx = r'title=\"([a-z\.\'A-Z0-9\s]*)\"'
findList = re.findall(titleRegEx, page)
appleList = []
sizeList = []
for item in findList:
    processedItem = item.lstrip().rstrip()
    processedItemList = processedItem.split(' ')
    appleList.append(processedItemList[0] + " "+ processedItemList[1])
    sizeList.append(processedItemList[2])

Using Python to extract string from title tag from html

Question

1 answers

solution1
0 2019-01-09 01:36:19

Using Python to extract string from title tag from html

Question

1 answers

solution1 0 2019-01-09 01:36:19

solution1
0 2019-01-09 01:36:19