简体   繁体   中英

How do I extract info from this table using python (ideally BeautifulSoup)

I'm attempting to gather information from this page: http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database#q/page=2

Particularly, I'm trying to gather information from the table using BeautifulSoup. I have the following code:

pagelink = 'http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database#q/page=2'
page = urllib2.urlopen(pagelink)
soup = BeautifulSoup(page)
soup.prettify()
print soup

When I do this the contents of the table (within the "tablebody" tag) do not show up. Why is this? How would I go about extracting information from this table?

The content you are looking for is NOT from that URL .

So basically when you manually browse a page in a modern web browser such as Chrome, what you see from that page, normally is not entirely from the URL you requested . The whole process would be: get contents from the url you requested originally -> parse the content -> load CSS/JavaScript/images (from different urls most of the times) -> layout the page/make extra requests as per CSS/JavaScript askes . It might look like all you got is solely from the URL you originally input in address bar, but in reality the browser does tons of behind of the scenes stuff to fully render a web page for you .

Now back to the page you are browsing, the content of that table is actually populated by JavaScript , which the browser parses first and then makes extra requests to get the content and render into a full page .

You can use tools such as Fiddler or Charles to capture the whole process and analyze all traffic to find out what happens behind the scene, in this case it's a POST request that getting content for that table:

POST http://www.gatesfoundation.org/services/gfo/search.ashx HTTP/1.1
Host: www.gatesfoundation.org
Connection: keep-alive
Content-Length: 209
Accept: */*
Origin: http://www.gatesfoundation.org
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Content-Type: application/json; charset=UTF-8;
Referer: http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cookie: gfo#lang=en; ASP.NET_SessionId=bdgjkbuyxxxcmfm40ejl2j1j; s_vnum=1641950372052%26vn%3D1; s_vi=[CS]v1|2C3C15910519363E-60000611E0003318[CE]; _vwo_uuid_v2=226610E3774AD35E29B29E7C20948349|f180edd6ae6830ab3de2432cd15b0bd4; __atuvc=3%7C2; __atuvs=58782b230157ce4a002; s_cc=true; s_nr=1484270424338; s_lv=1484270424339; s_lv_s=First%20Visit; s_invisit=true; gpv_p14=Awarded%20Grants; gpv_p19=How%20We%20Work; gpv_p21=no%20value; s_ppn=Awarded%20Grants; s_ppvl=Awarded%2520Grants%2C39%2C39%2C638%2C1366%2C638%2C1366%2C768%2C1%2CP; s_sq=%5B%5BB%5D%5D; s_ppv=Awarded%2520Grants%2C67%2C67%2C638%2C1366%2C638%2C1366%2C768%2C1%2CP

{"freeTextQuery":"","fieldQueries":"(@gfomediatype==\"Grant\")","facetsToRender":["gfocategories","gfotopics","gfoyear","gforegions"],"page":"2","resultsPerPage":"12","sortBy":"gfodate","sortDirection":"desc"}

And the response is JSON formatted :

{
  "topResults": [],
  "results": [
    {
      "amount": 648140,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-19T08:00:00",
      "description": "to validate biomarkers of growth stunting and environmental enteric dysfunction for the purpose of better understanding and diagnosing these related disease states",
      "grantee": "Stanford University",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Stanford University",
      "topics": [
        "Enteric Diseases and Diarrhea"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1161946",
      "year": "2016"
    },
    {
      "amount": 550000,
      "categories": [
        "Global Development"
      ],
      "date": "2016-12-15T08:00:00",
      "description": "to provide vital life-saving and sustaining support to populations most affected by conflict in Syria",
      "grantee": "World Vision",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "World Vision",
      "topics": [
        "Emergency Response"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169747",
      "year": "2016"
    },
    {
      "amount": 3315475,
      "categories": [
        "Global Development"
      ],
      "date": "2016-12-15T08:00:00",
      "description": "to fund activities focused on generating political will and building momentum for investment in nutrition at country level and supporting the development and implementation of the nutrition...",
      "grantee": "African Development Bank",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "African Development Bank",
      "topics": [
        "Nutrition"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1158425",
      "year": "2016"
    },
    {
      "amount": 500,
      "categories": [
        "Special Projects"
      ],
      "date": "2016-12-14T08:00:00",
      "description": "to provide for general operating support",
      "grantee": "City Club",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "City Club",
      "topics": [
        "Community Grants"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169105",
      "year": "2016"
    },
    {
      "amount": 78522,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-12T08:00:00",
      "description": "to make the first description of specific histo-blood group antigens (HBGAs) in Zambian children and to assess their influence on immunogenicity of rotavirus vaccines.",
      "grantee": "CIDRZ",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "CIDRZ",
      "topics": [
        "Enteric Diseases and Diarrhea",
        "Vaccine Delivery",
        "Vaccine Development"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1162810",
      "year": "2016"
    },
    {
      "amount": 300000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-09T08:00:00",
      "description": "to provide matching i3 funds with the goal of building professional capacity through effective professional development for teacher leaders and principals to improve college ready outcomes...",
      "grantee": "Leading Educators Inc",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Leading Educators Inc",
      "topics": [
        "K-12 Education"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169456",
      "year": "2016"
    },
    {
      "amount": 85330,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-09T08:00:00",
      "description": "to collect and analyze existing data from multiple data streams from Asian and African sites to characterize early burden of rotavirus disease, which is less-well characterized than...",
      "grantee": "Emory University",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Emory University",
      "topics": [
        "Enteric Diseases and Diarrhea",
        "Vaccine Delivery",
        "Vaccine Development"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1163272",
      "year": "2016"
    },
    {
      "amount": 13000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-08T08:00:00",
      "description": "to support LearnLaunch Across Boundaries Conference",
      "grantee": "LearnLaunch Institute",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "LearnLaunch Institute",
      "topics": [
        "K-12",
        "K-12 Education"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169222",
      "year": "2016"
    },
    {
      "amount": 250000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-08T08:00:00",
      "description": "to improve outcomes for English Language Learners in Seattle and South King County",
      "grantee": "OneAmerica",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "OneAmerica",
      "topics": [
        "Community Grants"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1164859",
      "year": "2016"
    },
    {
      "amount": 85000,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-08T08:00:00",
      "description": "to fund cholera / enteric researchers (travel costs) to attend the 51st US-Japan Cholera Conference that they would otherwise not be able to afford to contribute to.",
      "grantee": "International Vaccine Institute",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "International Vaccine Institute",
      "topics": [
        "Enteric Diseases and Diarrhea"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1168711",
      "year": "2016"
    },
    {
      "amount": 6000,
      "categories": [
        "Special Projects"
      ],
      "date": "2016-12-07T08:00:00",
      "description": "to provide for general operating support",
      "grantee": "Center for US Global Leadership",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Center for US Global Leadership",
      "topics": [
        "Community Grants"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1167614",
      "year": "2016"
    },
    {
      "amount": 3000000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-07T08:00:00",
      "description": "to support the Center on Education and the Workforce's research and policy agenda to better align postsecondary education and the workforce, with an emphasis on inequalities in the...",
      "grantee": "Georgetown University",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Georgetown University",
      "topics": [
        "Postsecondary Success"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1165028",
      "year": "2016"
    }
  ],
  "facets": [
    {
      "field": "gfocategories",
      "items": [
        {
          "name": "US Program",
          "count": 5859
        },
        {
          "name": "Global Development",
          "count": 4441
        },
        {
          "name": "Global Health",
          "count": 3719
        },
        {
          "name": "Communications",
          "count": 1149
        },
        {
          "name": "Global Policy & Advocacy",
          "count": 879
        },
        {
          "name": "Special Projects",
          "count": 465
        }
      ]
    },
    {
      "field": "gfotopics",
      "items": [
        {
          "name": "Community Grants",
          "count": 2393
        },
        {
          "name": "K-12 Education",
          "count": 2007
        },
        {
          "name": "Global Policy & Advocacy",
          "count": 1507
        },
        {
          "name": "Communications",
          "count": 1246
        },
        {
          "name": "Discovery and Translational Sciences",
          "count": 1227
        },
        {
          "name": "Agricultural Development",
          "count": 866
        },
        {
          "name": "K-12",
          "count": 862
        },
        {
          "name": "HIV",
          "count": 690
        },
        {
          "name": "Global Libraries",
          "count": 671
        },
        {
          "name": "Vaccine Delivery",
          "count": 655
        },
        {
          "name": "Postsecondary Success",
          "count": 645
        },
        {
          "name": "Family Health: Family Planning",
          "count": 625
        },
        {
          "name": "Family Health: Nutrition",
          "count": 530
        },
        {
          "name": "Family Health: Maternal, Newborn, and Child Health",
          "count": 433
        },
        {
          "name": "Community Relations",
          "count": 420
        },
        {
          "name": "Vaccine Development",
          "count": 393
        },
        {
          "name": "Not Available",
          "count": 383
        },
        {
          "name": "Malaria",
          "count": 377
        },
        {
          "name": "Water, Sanitation, and Hygiene",
          "count": 374
        },
        {
          "name": "Emergency Response",
          "count": 368
        },
        {
          "name": "Enteric Diseases and Diarrhea",
          "count": 359
        },
        {
          "name": "Family Interest Grants",
          "count": 313
        },
        {
          "name": "Pneumonia",
          "count": 286
        },
        {
          "name": "Nutrition",
          "count": 284
        },
        {
          "name": "Financial Services for the Poor",
          "count": 277
        },
        {
          "name": "Tuberculosis",
          "count": 277
        },
        {
          "name": "Libraries",
          "count": 262
        },
        {
          "name": "Charitable Sector Support",
          "count": 224
        },
        {
          "name": "Pacific Northwest: Family Homelessness",
          "count": 223
        },
        {
          "name": "College Ready",
          "count": 205
        },
        {
          "name": "Research & Development",
          "count": 195
        },
        {
          "name": "Polio",
          "count": 188
        },
        {
          "name": "Pacific Northwest: Early Learning",
          "count": 182
        },
        {
          "name": "Integrated Delivery",
          "count": 172
        },
        {
          "name": "Table Sponsorships",
          "count": 164
        },
        {
          "name": "Integrated Development",
          "count": 119
        },
        {
          "name": "Strategic Partnerships",
          "count": 117
        },
        {
          "name": "India",
          "count": 116
        },
        {
          "name": "Neglected Tropical Diseases",
          "count": 115
        },
        {
          "name": "Africa",
          "count": 89
        },
        {
          "name": "Special Initiatives (Active projects are now part of other strategies)",
          "count": 67
        },
        {
          "name": "Neglected and Infectious Diseases",
          "count": 66
        },
        {
          "name": "China",
          "count": 43
        },
        {
          "name": "Scholarships",
          "count": 39
        },
        {
          "name": "Tobacco",
          "count": 33
        },
        {
          "name": "Europe",
          "count": 22
        },
        {
          "name": "Special Initiatives",
          "count": 22
        },
        {
          "name": "Philanthropic Partnerships",
          "count": 17
        },
        {
          "name": "Europe Office",
          "count": 4
        }
      ]
    },
    {
      "field": "gfoyear",
      "items": [
        {
          "name": "2009 and earlier",
          "count": 6608
        },
        {
          "name": "2015",
          "count": 1652
        },
        {
          "name": "2016",
          "count": 1546
        },
        {
          "name": "2013",
          "count": 1473
        },
        {
          "name": "2014",
          "count": 1472
        },
        {
          "name": "2012",
          "count": 1260
        },
        {
          "name": "2011",
          "count": 1240
        },
        {
          "name": "2010",
          "count": 921
        },
        {
          "name": "2017",
          "count": 3
        }
      ]
    },
    {
      "field": "gforegions",
      "items": [
        {
          "name": "North America",
          "count": 5817
        },
        {
          "name": "Sub-Saharan Africa",
          "count": 1546
        },
        {
          "name": "Asia",
          "count": 1192
        },
        {
          "name": "Middle East, North Africa, and Greater Arabia",
          "count": 223
        },
        {
          "name": "South America",
          "count": 152
        },
        {
          "name": "Europe",
          "count": 130
        },
        {
          "name": "Central America and the Caribbean",
          "count": 110
        },
        {
          "name": "Australia and Oceania",
          "count": 29
        }
      ]
    }
  ],
  "totalCount": 16175
}

With the built-in json module, you could easily extract info you need.

You can get it with dryscrape like so:

import dryscrape
from bs4 import BeautifulSoup

ses = dryscrape.Session()
ses.visit("http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database#q/page=2")
s = BeautifulSoup(ses.body())
s2 = s.select("table.table.push-bottom")[0]
print s2

You'll not be able to use BeautifulSoup4 as intended, because the page is rendered through JavaScript.

You can either use dryscrape or selenium . Dryscrape is more user friendly in my opinion, but is not officially supported on Windows.

Also, check out avis' excellent answer regarding this:

https://stackoverflow.com/a/26440563/1429776

This page is rendered by JavaScript, requests or urllib can not handle the JS, they will only get html code. And as you can see, there is no table.

在浏览器中禁用JS

Use selenium or mimic the requsts of this page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM