简体   繁体   中英

Using Python to extract string from title tag from html

I'm trying to extract every something within title=" something" in .html file below using python.

<a class="BoxA" href="https://www.somethingsomething1.com" title=" AppleJuce 50x 122L">
...
</a></td>
<a class="BoxA" href="https://www.somethingsomething2.com" title=" AppleJam 100x 300L ">
...
</a></td>
and so on

Based on my search I think I should use

from lxml import html
import requests
import re

with open(r'C:\Users\Me\Desktop\1.html', "rb") as f:
    page = f.read()
tree = html.fromstring(page)
Titles= tree.xpath(...)

but I'm having trouble with ...somecode inside the of Titles= tree.xpath(...somecode)

Or is there any other way to do this? Thank you.

Also, I'd like to have AppleJuce 50x and their size 122L stored in two different lists, but don't know how to find a number before whitespace from the end of a string.

This is what I have so far for splitting the strings:

for title in Titles:
    number = re.search('\d', title)
    Apple= [title[:number.start()]]  #?????Is this right?
    size = [title[number.start():]]  #?????Is this right?
titleRegEx = r'title=\"([a-z\.\'A-Z0-9\s]*)\"'
findList = re.findall(titleRegEx, page)
appleList = []
sizeList = []
for item in findList:
    processedItem = item.lstrip().rstrip()
    processedItemList = processedItem.split(' ')
    appleList.append(processedItemList[0] + " "+ processedItemList[1])
    sizeList.append(processedItemList[2])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM