从 Python 中的文本文件中提取特定字符串

Question

Hi I have a copy of HTML code in a TEXT file,So i need to EXTRACT few information from that code,I managed to do it like this ,but i'm not getting any specific patterns to EXTRAXT the text.嗨，我有一个文本文件中的 HTML 代码副本，所以我需要从该代码中提取一些信息，我设法这样做了，但是我没有得到任何特定模式来提取文本。

Position : 27
        <a href="https://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&amp;noCS=1&amp;adword=google/PLA&amp;pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>

Position : 28
        <a href="https://www.vitaminexpress.org/de/ultra-b-complex-vitamin-b-kapseln" id="vplap27" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Ultra B Complex for €21.90 from vitaminexpress.org" data-nt-icon-id="planti27" data-title-id="vplaurlt27" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>

Position : 29
        <a href="https://www.narayana-verlag.de/Vitalstoff-Komplex-von-Robert-Franz-90-Kapseln/b22970" id="vplap28" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Vitalstoff-Komplex - von Robert Franz - 90 Kapseln for €26.00 from Narayana Verlag" data-nt-icon-id="planti28" data-title-id="vplaurlt28" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>

I want to EXTRACT the URL link after "href",and the name of the Product after the TEXT "aria-label".How can i do that in Python?我想提取“href”之后的 URL 链接，以及文本“aria-label”之后的产品名称。我如何在 Python 中做到这一点？

Currently i'm using the below script for finding the lines which is of interest to me,目前我正在使用下面的脚本来查找我感兴趣的行，

import psycopg2

try:

    filepath = filePath='''/Users/lins/Downloads/pladiv.txt''' 

    with open(filePath, 'r') as file:

       print('entered loop')
       cnt=1
       for line in file: 
        if 'pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="' in line:
          print('Position : ' + str(cnt))
          cnt=cnt+1
          if 'href="' in line:
            print(line)
            fields=line.split(";")
            #print(fields[0] + '  as URL')

except (Exception, psycopg2.Error) as error:
        quit()

Note: I was inserting it to my PostgreSQL DB, The code is removed in the above sample.注意：我将它插入到我的 PostgreSQL 数据库中，上面的示例中删除了代码。

Answer 1

You can either use regex, like this您可以使用正则表达式，就像这样

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)

>>> print urls
['http://example.com', 'http://example2.com']

Or you can parse the file as HTML或者您可以将文件解析为 HTML

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

Either way both is fine.无论哪种方式都很好。 EDIT - to get the entire value of href you can use this,编辑 - 要获得 href 的全部价值，您可以使用它，

url = """<a href="http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&amp;noCS=1&amp;adword=google/PLA&amp;pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>"""

findall = re.findall("(https?://[^\s]+)", url)

print(findall)

['http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&amp;noCS=1&amp;adword=google/PLA&amp;pk_campaign=google/PLA"']

从 Python 中的文本文件中提取特定字符串

问题描述

1 个解决方案

解决方案1
0 2020-01-09 11:18:05

从 Python 中的文本文件中提取特定字符串

问题描述

1 个解决方案

解决方案1 0 2020-01-09 11:18:05

解决方案1
0 2020-01-09 11:18:05