简体   繁体   中英

Extract parts of text (html) file based on characters before & after with python

I am trying to build a script that will extract specific parts (namely the link & its related description) out of an html file and return the result per line.

I 'm trying to build it using the lists in python, yet I 'm making a mistake somehow!

This is what I 've done so far, but it returns blank my values list:


import re

def subtext (data, first_link, last_link, first_descr, last_descr):
    values = []
    
    link = re.search('''"first_link"(.+?)"last_link"''', data)
    values.append(link)
    descr = re.search('''"first_descr"(.+?)"last_descr"''', data)
    values.append(descr)
    while values:
        print(values)


html_file = input ("Type filepath: ")
html_code = open (html_file, "r")
html_data = html_code.read()


subtext (html_data, '''11px;"><a href=''', ''' target="_blank"  ''', '''  title="Relative document">''', '''</a></td><td style="font-''')


html_code.close()

There is a html parser for python . But if you want use your code then you need fix those mistakes:

link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)

First of all, Your regex will search for strings "first_link" and "last_link" instead of values from function args. Use .format to create string form args. Also in above code link will be re.Match object, not a string. Use group() to pick string from object - just make sure that it found something. Same story with next re.search .

   while values:
      print(values)

Here you will get into infinite loop of prints. Simply do print(values) without any loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM