I'm trying to extract every something
within title=" something"
in .html file below using python.
<a class="BoxA" href="https://www.somethingsomething1.com" title=" AppleJuce 50x 122L">
...
</a></td>
<a class="BoxA" href="https://www.somethingsomething2.com" title=" AppleJam 100x 300L ">
...
</a></td>
and so on
Based on my search I think I should use
from lxml import html
import requests
import re
with open(r'C:\Users\Me\Desktop\1.html', "rb") as f:
page = f.read()
tree = html.fromstring(page)
Titles= tree.xpath(...)
but I'm having trouble with ...somecode
inside the of Titles= tree.xpath(...somecode)
Or is there any other way to do this? Thank you.
Also, I'd like to have AppleJuce 50x
and their size 122L
stored in two different lists, but don't know how to find a number before whitespace from the end of a string.
This is what I have so far for splitting the strings:
for title in Titles:
number = re.search('\d', title)
Apple= [title[:number.start()]] #?????Is this right?
size = [title[number.start():]] #?????Is this right?
titleRegEx = r'title=\"([a-z\.\'A-Z0-9\s]*)\"'
findList = re.findall(titleRegEx, page)
appleList = []
sizeList = []
for item in findList:
processedItem = item.lstrip().rstrip()
processedItemList = processedItem.split(' ')
appleList.append(processedItemList[0] + " "+ processedItemList[1])
sizeList.append(processedItemList[2])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.