简体   繁体   English

引述Python Scraper的混乱

[英]Quotes Messing Up Python Scraper

I am trying to scrape all the data within a div as follows. 我正在尝试抓取div中的所有数据,如下所示。 However, the quotes are throwing me off. 但是,引号使我失望。

<div id="address">
    <div class="info">14955 Shady Grove Rd.</div> 
    <div class="info">Rockville, MD 20850</div> 
    <div class="info">Suite: 300</div> 
</div>

I am trying to start it with something along the lines of 我正在尝试从以下方面开始

addressStart = page.find("<div id="address">")

but the quotes within the div are messing me up. 但是div中的引号使我感到困惑。 Does anybody know how I can fix this? 有人知道我该如何解决吗?

To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself: 要回答您的特定问题,您需要对引号进行转义 ,或在字符串本身上使用不同类型的引号:

addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')

But don't do that. 但是不要那样做。 If you are trying to "parse" HTML, let a third-party library do that. 如果您试图“解析” HTML,请让第三方库来做。 Try Beautiful Soup . 尝试美丽的汤 You get a nice object back which you can use to traverse or search. 您会得到一个不错的对象,可用于遍历或搜索。 You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML: 您可以获取属性,值等...而不必担心解析HTML或XML的复杂性:

from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
    for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
        print info.string

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM