Hi I have a copy of HTML code in a TEXT file,So i need to EXTRACT few information from that code,I managed to do it like this ,but i'm not getting any specific patterns to EXTRAXT the text.
Position : 27
<a href="https://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>
Position : 28
<a href="https://www.vitaminexpress.org/de/ultra-b-complex-vitamin-b-kapseln" id="vplap27" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Ultra B Complex for €21.90 from vitaminexpress.org" data-nt-icon-id="planti27" data-title-id="vplaurlt27" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>
Position : 29
<a href="https://www.narayana-verlag.de/Vitalstoff-Komplex-von-Robert-Franz-90-Kapseln/b22970" id="vplap28" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Vitalstoff-Komplex - von Robert Franz - 90 Kapseln for €26.00 from Narayana Verlag" data-nt-icon-id="planti28" data-title-id="vplaurlt28" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>
I want to EXTRACT the URL link after "href",and the name of the Product after the TEXT "aria-label".How can i do that in Python?
Currently i'm using the below script for finding the lines which is of interest to me,
import psycopg2
try:
filepath = filePath='''/Users/lins/Downloads/pladiv.txt'''
with open(filePath, 'r') as file:
print('entered loop')
cnt=1
for line in file:
if 'pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="' in line:
print('Position : ' + str(cnt))
cnt=cnt+1
if 'href="' in line:
print(line)
fields=line.split(";")
#print(fields[0] + ' as URL')
except (Exception, psycopg2.Error) as error:
quit()
Note: I was inserting it to my PostgreSQL DB, The code is removed in the above sample.
You can either use regex, like this
import re
url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://example2.com']
Or you can parse the file as HTML
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']
Either way both is fine. EDIT - to get the entire value of href you can use this,
url = """<a href="http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>"""
findall = re.findall("(https?://[^\s]+)", url)
print(findall)
['http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA"']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.