简体   繁体   中英

Scraping a table using BeautifulSoup

I have a question which i suspect is fairly straight forward. I have the following type of page from which I want to collect the information in the last table (if you scroll all the way down it is the one in the box labelled "Procedure"):

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN

The html for the table I want to scrape looks like this:

<tbody><tr class="doc_title">
<td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="left" valign="top"><img src="/img/struct/functional/arrow_title_doc.gif" alt="" align="absmiddle" border="0" height="14" width="8"> <span style="font-weight: bold;">PROCEDURE</span></td><td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="right" valign="top">
<table border="0" cellpadding="3" cellspacing="0" width="50">
<tbody><tr><td align="center"><a href="#top"><img src="/img/struct/functional/top_doc.gif" alt="" border="0" height="16" width="16"></a></td><td align="center"><img src="/img/struct/navigation/spacer.gif" alt="" border="0" height="10" width="15"></td><td align="center"><a href="#title2"><img src="/img/struct/functional/sort_up.gif" alt="" border="0" height="10" width="15"></a></td></tr></tbody></table></td></tr>

<tr class="contents" valign="top"><td colspan="2">
<p></p><table style="border-collapse: collapse; width: 481.85pt;" align="center" cellspacing="0">
<tbody><tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Title</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Mutual assistance for the recovery of claims relating to taxes, duties and other measures</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">References</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style=""><a href="http://ec.europa.eu/prelex/liste_resultats.cfm?CL=en&amp;ReqId=0&amp;DocType=COM&amp;DocYear=2009&amp;DocNum=0028">COM(2009)0028</a> – C6-0061/2009 – <a href="/oeil/FindByProcnum.do?lang=en&amp;procnum=CNS/2009/0007">2009/0007(CNS)</a></p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date of consulting Parliament</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">16.2.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee responsible</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">ECON</p>

<p style="">19.10.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee(s) asked for opinion(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Not delivering opinions</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date of decision</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">1.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">5.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Rapporteur(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date appointed</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="3">
<p style="">Theodor Dumitru Stolojan</p>

<p style="">21.7.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Discussed in committee</span></p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">10.11.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">1.12.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">21.1.2010</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date adopted</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">27.1.2010</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Result of final vote</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 12.94%;" rowspan="1" colspan="1">
<p style="">+:</p>

<p style="">–:</p>

<p style="">0:</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 48.82%;" rowspan="1" colspan="6">
<p style="">39</p>

<p style="">0</p>

<p style="">1</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Members present for the final vote</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Burkhard Balz, Sharon Bowles, Udo Bullmann, Pascal Canfin, Nikolaos Chountis, George Sabin Cutaş, Leonardo Domenici, Derk Jan Eppink, Markus Ferber, Elisa Ferreira, Vicky Ford, José Manuel García-Margallo y Marfil, Jean-Paul Gauzès, Sylvie Goulard, Enikő Győri, Liem Hoang Ngoc, Eva Joly, Othmar Karas, Wolf Klinz, Jürgen Klute, Werner Langen, Astrid Lulling, Arlene McCarthy, Ivari Padar, Alfredo Pallone, Anni Podimata, Antolín Sánchez Presedo, Olle Schmidt, Edward Scicluna, Peter Simon, Peter Skinner, Theodor Dumitru Stolojan, Ivo Strejček, Kay Swinburne, Marianne Thyssen, Ramon Tremosa i Balcells</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Substitute(s) present for the final vote</span></p>
</td>
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Marta Andreasen, Sophie Briard Auconie, David Casa, Danuta Jazłowiecka, Arturs Krišjānis Kariņš, Philippe Lamberts, Andreas Schwab</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 38.24%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 12.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 2.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 4.71%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10.58%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 5.29%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 15.3%;" rowspan="1" colspan="1"></td>
<td style="" rowspan="1" colspan="1"></td></tr>
</tbody></table>
</td></tr>
</tbody>

The problem I am facing is that the tags for the tables do not have identifiers (as far as I can tell), so I dont know how to select this one table and scrape the information from it. I have been using BeautifilSoup so far to get other information from the website, but I am at a loss for how to scrape this one table.

If anyone can show me how to proceed I would ne most grateful!

With kind regards,

Thomas

You can find elements by other attributes if you're a bit clever. I took this shot at scraping your data, and it probably isn't the best – but, it gets you close.

The first thing I noticed was you definitely wanted data after the second appearance of the word "PROCEDURE" (first being the link, second being the header). So, I split on that:

data = html.split("PROCEDURE", 2)[2]

Then, I looked for <td> tags with rowspan=1 :

bs = BeautifulSoup.BeautifulSoup(data)
tds = bs.findAll("td", { "rowspan": 1 })

Getting closer...

>>> tds[0].text
u'Title'
>>> tds[1].text
u'Mutual assistance for the recovery of claims relating to taxes, duties and other measures'
>>> tds[3].text
u'References'
>>> tds[4].text
u'COM(2009)00282009/0007(CNS)2009 a>'

Note that I skipped index 2 in tds , since they use a spacer or something (it's empty). Anyway, that's a start. The real trick I found with BeautifulSoup was to only feed it the data in the area you know you're looking for, because then there's less to skim through. It prides itself on accepting bad-looking input, too, so don't be afraid to feed it garbage.

I went a bit further in the list of elements, and it isn't perfect. You'll need to refine the search, since they have <td> elements within <td> s for the values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM