[英]Parsing HTML in python3, re, html.parser, or something else?
I'm trying to get a list of craigslist states and their associates urls. 我正在尝试获取craigslist状态及其关联网址的列表。 Don't worry, I have no intentions of spaming, if you're wondering what this is for see the * below.
不用担心,如果您想知道这是什么意思,请参阅下面的*,我没有垃圾邮件的意图。
What I'm trying to extract begins the line after 'us states' and is the next 50 < li >'s. 我要提取的内容是在“美国各州”之后的那一行开始的,是接下来的50个<li>。 I read through html.parser's docs and it seemed too low level for this, more aimed at making a dom parser or syntax highlighting/formatting in an ide as opposed to searching which makes me think my best bet is using re's.
我通读了html.parser的文档,对于它来说似乎太低了,它更多地旨在制作dom解析器或在一个想法中突出显示语法/格式化而不是搜索,这使我认为最好的选择是使用re's。 I would like to keep myself contained to what's in the standard library just for the sake of learning.
为了学习,我想让自己专注于标准库中的内容。 I'm not asking for help writing a regular expression, I'll figure that out on my own, just making sure there's not a better way to do this before spending the time on that.
我并不是在寻求帮助来编写正则表达式,我会自己弄清楚这一点,只是要确保在花时间之前没有更好的方法可以做到这一点。
*This is my first program or anything beyond simple python scripts. *这是我的第一个程序,或者除简单的python脚本之外的任何程序。 I'm making a c++ program to manage my posts and remind me when they've expired in case I want to repost them, and a python script to download a list of all of the US states and cities/areas in order to populate a combobox in the gui.
我正在制作一个c ++程序来管理我的帖子,并在它们过期时提醒我,以防万一我想重新发布它们;还有一个python脚本,用于下载美国所有州和城市/地区的列表,以便填充gui中的combobox。 I really don't need it, but I'm aiming to make this 'production ready'/feature complete both as a learning exercise and to create a portfolio to possibly get a job.
我确实不需要它,但我的目标是使这种“生产准备就绪” /功能既可以作为学习活动来完成,又可以创建可能获得工作的投资组合。 I don't know if I'll make the program publicly available or not, there's obvious potential for misuse and is probably against their ToS anyway.
我不知道是否要将该程序公开发布,是否存在滥用的明显可能性,而且无论如何都可能违反其服务条款。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.