[英]How can I parse HTML code with “html written” URL in Python?
I am starting to program in Python, and have been reading a couple of posts where they say that I should use an HTML parser to get an URL from a text rather than re. 我开始使用Python编程,并且已经阅读了几篇文章,他们说我应该使用HTML解析器从文本获取URL,而不是重新获取。
I have the source code which I got from page.read()
with the urllib
and urlopen
. 我有从urllib
和urlopen
从page.read()
获得的源代码。
Now, my problem is that the parser is removing the url part from the text. 现在,我的问题是解析器正在从文本中删除url部分。
Also, if I had read correctly, var = page.read()
, var
is stored as a string? 另外,如果我没看错, var = page.read()
, var
是否存储为字符串?
How can I tell it to give me the text between 2 "tags"? 如何告诉我两个“标签”之间的文字? The URL is always in between flv=
and ;
网址始终在flv=
和;
之间;
so and as such it doesn't start with href
which is what the parsers look for, and it doesn't contain http://
either. 因此,它并非以解析器所要查找的href
开头,也不包含http://
。
I have read many posts, but it seems they all look for ``href in the code. 我读了很多帖子,但似乎他们都在代码中寻找``href。
Do I have it all completely wrong? 我是否完全错了?
Thank you! 谢谢!
You could consider implementing your own search / grab. 您可以考虑实施自己的搜索/获取。 In psuedocode, it would look a little like this: 在psuedocode中,它看起来像这样:
find location of 'flv=' in HTML = location_start
find location of ';' in HTML = location_end
grab everything in between: HTML[location_start : location_end]
You should be able to implement this in python. 您应该能够在python中实现此功能。
Good luck! 祝好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.