简体   繁体   English

Python 数据抓取工具

[英]Python Data Scraper

I wrote the following line of code我写了以下代码行

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    table = soup.find_all("table", class_="responsive airport-history-summary-table")
    tr = soup.find_all("tr")
    td = soup.find_all("td")
    print table
            

if __name__ == "__main__":
    main()

When I print the table i get all the html (td, tr, span, etc.) as well.当我打印表格时,我也会得到所有的 html(td、tr、span 等)。 How can I print the content of the table (tr, td) without the html?如何在没有 html 的情况下打印表 (tr, td) 的内容?
THANKS!谢谢!

You have to use .getText() method when you want to get a content.当您想要获取内容时,您必须使用.getText()方法。 Since find_all returns a list of elements, you have to choose one of them ( td[0] ).由于find_all返回一个元素列表,您必须选择其中之一 ( td[0] )。

Or you can do for example:或者你可以做例如:

for tr in soup.find_all("tr"):
    print '>>>> NEW row <<<<'
    print '|'.join([x.getText() for x in tr.find_all('td')])

The loop above prints for each row cell next to cell.上面的循环为单元格旁边的每一行单元格打印。

Note that you do find all td 's and all tr 's your way but you probably want to get just those in table .请注意,您确实找到了所有td和所有tr的方式,但您可能只想获得table那些。

If you want to look for elements inside the table , you have to do this:如果要查找table元素,则必须执行以下操作:

table.find('tr') instead of soup.find('tr) so the BeautifulSoup will be looking for tr s in the table instead of whole html . table.find('tr')而不是soup.find('tr)所以BeautifulSoup将在table寻找tr而不是整个html

YOUR CODE MODIFIED (according to your comment that there are more tables):您的代码已修改(根据您的评论,有更多表格):

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        print '>>>>>>> NEW TABLE <<<<<<<<<'

        trs = table.find_all("tr")

        for tr in trs:
            # for each row of current table, write it using | between cells
            print '|'.join([x.get_text().replace('\n','') for x in tr.find_all('td')])



if __name__ == "__main__":
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM