简体   繁体   English

urllib返回html但没有结束段落标签

[英]Urllib returning html but no closing paragraph tags

I am scraping the Presidential debate transcripts. 我正在抓取总统辩论的笔录。 I noticed that when my scraper pulls the html elements it never pulls a paragraph-end tag ( </p> ). 我注意到,当我的抓取工具提取html元素时,它从不提取段落结尾标记( </p> )。

eg 例如

Checking the source in browser 在浏览器中检查源 从Chrome的“视图”>“开发人员”>“查看源代码”

url_to_scrape = 'http://www.presidency.ucsb.edu/ws/index.php?pid=119039'
req = urllib.request.Request(url_to_scrape)
resp = urllib.request.urlopen(req)
resp.read()

Python结果

I figure there are one of two things going on: 我认为发生了以下两种情况之一:

  1. urllib is somehow dropping closing tags (for just paragraphs, the rest are fine) urllib以某种方式删除了结束标记(仅对段落而言,其余的都很好)
  2. The raw source doesn't include closing tags, and the browser is filling them in. 原始资源不包含结束标记,浏览器正在填充它们。

How do I figure out which one it is, and then correct for it? 如何确定是哪一个,然后更正呢?

Can you check the actual packet that Chrome received? 您可以检查Chrome收到的实际数据包吗? In some circumstances, Chrome will detect and correct small omissions like this one in order to display the page, even if they're not in the packet. 在某些情况下,Chrome会检测到并纠正此类小遗漏以显示页面,即使它们不在数据包中也是如此。 My guess is that Chrome fixed this, and the actual source is bad. 我的猜测是Chrome修复了此问题,而实际来源却很糟糕。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM