简体   繁体   English

在python3,re,html.parser或其他内容中解析HTML?

[英]Parsing HTML in python3, re, html.parser, or something else?

I'm trying to get a list of craigslist states and their associates urls. 我正在尝试获取craigslist状态及其关联网址的列表。 Don't worry, I have no intentions of spaming, if you're wondering what this is for see the * below. 不用担心,如果您想知道这是什么意思,请参阅下面的*,我没有垃圾邮件的意图。

What I'm trying to extract begins the line after 'us states' and is the next 50 < li >'s. 我要提取的内容是在“美国各州”之后的那一行开始的,是接下来的50个<li>。 I read through html.parser's docs and it seemed too low level for this, more aimed at making a dom parser or syntax highlighting/formatting in an ide as opposed to searching which makes me think my best bet is using re's. 我通读了html.parser的文档,对于它来说似乎太低了,它更多地旨在制作dom解析器或在一个想法中突出显示语法/格式化而不是搜索,这使我认为最好的选择是使用re's。 I would like to keep myself contained to what's in the standard library just for the sake of learning. 为了学习,我想让自己专注于标准库中的内容。 I'm not asking for help writing a regular expression, I'll figure that out on my own, just making sure there's not a better way to do this before spending the time on that. 我并不是在寻求帮助来编写正则表达式,我会自己弄清楚这一点,只是要确保在花时间之前没有更好的方法可以做到这一点。

*This is my first program or anything beyond simple python scripts. *这是我的第一个程序,或者除简单的python脚本之外的任何程序。 I'm making a c++ program to manage my posts and remind me when they've expired in case I want to repost them, and a python script to download a list of all of the US states and cities/areas in order to populate a combobox in the gui. 我正在制作一个c ++程序来管理我的帖子,并在它们过期时提醒我,以防万一我想重新发布它们;还有一个python脚本,用于下载美国所有州和城市/地区的列表,以便填充gui中的combobox。 I really don't need it, but I'm aiming to make this 'production ready'/feature complete both as a learning exercise and to create a portfolio to possibly get a job. 我确实不需要它,但我的目标是使这种“生产准备就绪” /功能既可以作为学习活动来完成,又可以创建可能获得工作的投资组合。 I don't know if I'll make the program publicly available or not, there's obvious potential for misuse and is probably against their ToS anyway. 我不知道是否要将该程序公开发布,是否存在滥用的明显可能性,而且无论如何都可能违反其服务条款。

There is xml.etree an XML Parser available in the Python Standard library itself. 在Python标准库本身中有一个xml.etree XML解析器。 You should not using regex for parsing XMLs. 您不应该使用正则表达式来解析XML。 Go the particular node where you find the information and extract the links from that. 转到找到信息的特定节点,然后从中提取链接。

Use lxml.html . 使用lxml.html It's the best python html parser. 这是最好的python html解析器。 It supports xpath! 它支持xpath!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM