[英]Extracting particular string from a text file in Python
Hi I have a copy of HTML code in a TEXT file,So i need to EXTRACT few information from that code,I managed to do it like this ,but i'm not getting any specific patterns to EXTRAXT the text.嗨,我有一个文本文件中的 HTML 代码副本,所以我需要从该代码中提取一些信息,我设法这样做了,但是我没有得到任何特定模式来提取文本。
Position : 27
<a href="https://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>
Position : 28
<a href="https://www.vitaminexpress.org/de/ultra-b-complex-vitamin-b-kapseln" id="vplap27" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Ultra B Complex for €21.90 from vitaminexpress.org" data-nt-icon-id="planti27" data-title-id="vplaurlt27" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>
Position : 29
<a href="https://www.narayana-verlag.de/Vitalstoff-Komplex-von-Robert-Franz-90-Kapseln/b22970" id="vplap28" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="Vitalstoff-Komplex - von Robert Franz - 90 Kapseln for €26.00 from Narayana Verlag" data-nt-icon-id="planti28" data-title-id="vplaurlt28" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>
I want to EXTRACT the URL link after "href",and the name of the Product after the TEXT "aria-label".How can i do that in Python?我想提取“href”之后的 URL 链接,以及文本“aria-label”之后的产品名称。我如何在 Python 中做到这一点?
Currently i'm using the below script for finding the lines which is of interest to me,目前我正在使用下面的脚本来查找我感兴趣的行,
import psycopg2
try:
filepath = filePath='''/Users/lins/Downloads/pladiv.txt'''
with open(filePath, 'r') as file:
print('entered loop')
cnt=1
for line in file:
if 'pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="' in line:
print('Position : ' + str(cnt))
cnt=cnt+1
if 'href="' in line:
print(line)
fields=line.split(";")
#print(fields[0] + ' as URL')
except (Exception, psycopg2.Error) as error:
quit()
Note: I was inserting it to my PostgreSQL DB, The code is removed in the above sample.注意:我将它插入到我的 PostgreSQL 数据库中,上面的示例中删除了代码。
You can either use regex, like this您可以使用正则表达式,就像这样
import re
url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
>>> print urls
['http://example.com', 'http://example2.com']
Or you can parse the file as HTML或者您可以将文件解析为 HTML
>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']
Either way both is fine.无论哪种方式都很好。 EDIT - to get the entire value of href you can use this,
编辑 - 要获得 href 的全部价值,您可以使用它,
url = """<a href="http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA" id="vplap26" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" class="plantl pla-unit-single-clickable-target clickable-card" rel="noopener noreferrer" target="_blank" aria-label="0,12 € / 1,00 St. Doppelherz Folsäure 800+B-Vitamine Tabletten 40 St. for €4.64 from fliegende-pillen.de" data-nt-icon-id="planti26" data-title-id="vplaurlt26" jsaction="mouseover:pla.sntiut;mouseout:pla.hntiut"></a>"""
findall = re.findall("(https?://[^\s]+)", url)
print(findall)
['http://www.fliegende-pillen.de/product/doppelherz-folsaeure-800-b-vitamine-tabletten.230477.html?p=466453&noCS=1&adword=google/PLA&pk_campaign=google/PLA"']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.