简体   繁体   中英

Extracting particular string from a text file in Python

Hi I have a copy of HTML code in a TEXT file,So i need to EXTRACT few information from that code,I managed to do it like this ,but i'm not getting any specific patterns to EXTRAXT the text.

Position : 27
        <a href="https://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&amp;noCS=1&amp;adword=google/PLA&amp;pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>

Position : 28
        <a href="https://www.vitaminexpress.org/de/ultra-b-complex-vitamin-b-kapseln" id="vplap27" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Ultra B Complex for €21.90 from vitaminexpress.org" data-nt-icon-id="planti27" data-title-id="vplaurlt27" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>

Position : 29
        <a href="https://www.narayana-verlag.de/Vitalstoff-Komplex-von-Robert-Franz-90-Kapseln/b22970" id="vplap28" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Vitalstoff-Komplex - von Robert Franz - 90 Kapseln for €26.00 from Narayana Verlag" data-nt-icon-id="planti28" data-title-id="vplaurlt28" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>

I want to EXTRACT the URL link after "href",and the name of the Product after the TEXT "aria-label".How can i do that in Python?

Currently i'm using the below script for finding the lines which is of interest to me,

import psycopg2

try:

    filepath = filePath='''/Users/lins/Downloads/pladiv.txt''' 

    with open(filePath, 'r') as file:

       print('entered loop')
       cnt=1
       for line in file: 
        if 'pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="' in line:
          print('Position : ' + str(cnt))
          cnt=cnt+1
          if 'href="' in line:
            print(line)
            fields=line.split(";")
            #print(fields[0] + '  as URL')

except (Exception, psycopg2.Error) as error:
        quit()

Note: I was inserting it to my PostgreSQL DB, The code is removed in the above sample.

You can either use regex, like this

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://example2.com']

Or you can parse the file as HTML

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

Either way both is fine. EDIT - to get the entire value of href you can use this,

url = """<a href="http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&amp;noCS=1&amp;adword=google/PLA&amp;pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>"""

findall = re.findall("(https?://[^\s]+)", url)

print(findall)

['http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&amp;noCS=1&amp;adword=google/PLA&amp;pk_campaign=google/PLA"']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM