[英]Use regular epxression to extract text from html code in python
I have a body of html code scraped from a website using beautifulsoup. 我有一个使用beautifulsoup从网站抓取的HTML代码。 I want to use regular expressions in python to extract a portion of a url from the html code.
我想在python中使用正则表达式从html代码中提取一部分网址。 Here is a portion of the html:
这是html的一部分:
<link rel="stylesheet" type="text/css" href="/include/xbrlViewerStyle.css">
<style type="text/css">li.octave {border-top: 1px solid black;}</style>
<!--[if lt IE 8]>
<style type="text/css">
li.accordion a {display:inline-block;}
li.accordion a {display:block;}
</style>
<![endif]-->
<script type="text/javascript" language="javascript">
var InstanceReportXslt = "/include/InstanceReport.xslt";
var reports = new Array(161);
reports[0+1] = "/Archives/edgar/data/49196/000004919618000008/R1.htm";
reports[1+1] = "/Archives/edgar/data/49196/000004919618000008/R2.htm";
reports[2+1] = "/Archives/edgar/data/49196/000004919618000008/R3.htm";
reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm";
reports[4+1] = "/Archives/edgar/data/49196/000004919618000008/R5.htm";
reports[5+1] = "/Archives/edgar/data/49196/000004919618000008/R6.htm";
reports[6+1] = "/Archives/edgar/data/49196/000004919618000008/R7.htm";
reports[7+1] = "/Archives/edgar/data/49196/000004919618000008/R8.htm";
reports[8+1] = "/Archives/edgar/data/49196/000004919618000008/R9.htm";
reports[9+1] = "/Archives/edgar/data/49196/000004919618000008/R10.htm";
reports[10+1] = "/Archives/edgar/data/49196/000004919618000008/R11.htm"
I want to use regular expressions to identify "R4" to extract "/Archives/edgar/data/49196/000004919618000008/R4.htm". 我想使用正则表达式来标识“ R4”以提取“ /Archives/edgar/data/49196/000004919618000008/R4.htm”。
You can use this expression: 您可以使用以下表达式:
>>> import re
>>> s = '''reports[0+1] = "/Archives/edgar/data/49196/000004919618000008/R1.htm";
... reports[1+1] = "/Archives/edgar/data/49196/000004919618000008/R2.htm";
... reports[2+1] = "/Archives/edgar/data/49196/000004919618000008/R3.htm";
... reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm";
... reports[4+1] = "/Archives/edgar/data/49196/000004919618000008/R5.htm";
... reports[5+1] = "/Archives/edgar/data/49196/000004919618000008/R6.htm";
... reports[6+1] = "/Archives/edgar/data/49196/000004919618000008/R7.htm";
... reports[7+1] = "/Archives/edgar/data/49196/000004919618000008/R8.htm";'''
>>> for i in re.findall(r'([\w./]+R4[\w./]+)', a):
... print(i)
...
/Archives/edgar/data/49196/000004919618000008/R4.htm
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.