使用 python 根据前后字符提取部分文本 (html) 文件

Question

I am trying to build a script that will extract specific parts (namely the link & its related description) out of an html file and return the result per line.我正在尝试构建一个脚本，该脚本将从 html 文件中提取特定部分（即链接及其相关描述）并返回每行的结果。

I 'm trying to build it using the lists in python, yet I 'm making a mistake somehow!我正在尝试使用 python 中的列表来构建它，但不知何故我犯了一个错误！

This is what I 've done so far, but it returns blank my values list:这是我到目前为止所做的，但它返回空白我的值列表：


import re

def subtext (data, first_link, last_link, first_descr, last_descr):
    values = []
    
    link = re.search('''"first_link"(.+?)"last_link"''', data)
    values.append(link)
    descr = re.search('''"first_descr"(.+?)"last_descr"''', data)
    values.append(descr)
    while values:
        print(values)


html_file = input ("Type filepath: ")
html_code = open (html_file, "r")
html_data = html_code.read()


subtext (html_data, '''11px;"><a href=''', ''' target="_blank"  ''', '''  title="Relative document">''', '''</a></td><td style="font-''')


html_code.close()

Answer 1

There is a html parser for python . python 有一个html 解析器。 But if you want use your code then you need fix those mistakes:但是如果你想使用你的代码，那么你需要修复这些错误：

link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)

First of all, Your regex will search for strings "first_link" and "last_link" instead of values from function args.首先，您的正则表达式将搜索字符串“first_link”和“last_link”，而不是 function 参数中的值。 Use .format to create string form args.使用.format创建字符串形式的参数。 Also in above code link will be re.Match object, not a string.同样在上面的代码link中，将重新匹配 object，而不是字符串。 Use group() to pick string from object - just make sure that it found something.使用group()从 object 中挑选字符串 - 只要确保它找到了一些东西。 Same story with next re.search .与下一次研究相同的re.search 。

   while values:
      print(values)

Here you will get into infinite loop of prints.在这里，您将进入无限循环的打印。 Simply do print(values) without any loop.只需在没有任何循环的情况下执行print(values) 。

使用 python 根据前后字符提取部分文本 (html) 文件

问题描述

1 个解决方案

解决方案1
0 2021-02-12 18:46:53

使用 python 根据前后字符提取部分文本 (html) 文件

问题描述

1 个解决方案

解决方案1 0 2021-02-12 18:46:53

解决方案1
0 2021-02-12 18:46:53