简体   繁体   English

漂亮的汤网页刮板

[英]Beautiful soup web page scraper

I am trying to scrape a webpage with following url https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00我正在尝试使用以下网址抓取网页https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00

and I want to scrape a table with following html code.我想用以下 html 代码刮一张表。 I have tried few things but not able to achieve the desired table to insert into csv.Here the <"tr"> tag is not closed for the data so segregating the data into different row is an issue.我尝试了几件事,但无法实现插入 csv 的所需表。这里<"tr">标记没有关闭数据,因此将数据分隔到不同的行是一个问题。

Thanks for help --J感谢您的帮助--J

<table border='0' width='900' align='center' cellspacing='1' cellpadding='4'>
                <tr>
                    <td class='innertable_header1' rowspan='3'>Category of shareholder</td>
                    <td class='innertable_header1' rowspan='3'>Nos. of shareholders</td>
                    <td class='innertable_header1' rowspan='3'>No. of fully paid up equity shares held</td>
                    <td class='innertable_header1' rowspan='3'>No. of shares underlying Depository Receipts</td>
                    <td class='innertable_header1' rowspan='3'>Total nos. shares held</td>
                    <td class='innertable_header1' rowspan='3'>Shareholding as a % of total no. of shares (calculated as per SCRR, 1957)As a % of (A+B+C2)</td>
                    <td class='innertable_header1' rowspan='3'> Number of equity shares held in dematerialized form</td>
                </tr>
                <tr></tr>
                <tr></tr>
                <tr>
                    <td class='TTRow_left'>(A) Promoter & Promoter Group</td>
                    <td class='TTRow_right'>19</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'></td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'>12.90</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <tr>
                        <td class='TTRow_left'>(B) Public</td>
                        <td class='TTRow_right'>9,16,058</td>
                        <td class='TTRow_right'>1,87,81,45,362</td>
                        <td class='TTRow_right'>1,32,95,642</td>
                        <td class='TTRow_right'>1,89,14,41,004</td>
                        <td class='TTRow_right'>86.61</td>
                        <td class='TTRow_right'>1,88,74,40,959</td>
                        <tr>
                            <td class='TTRow_left'>(C1) Shares underlying DRs</td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'>0.00</td>
                            <td class='TTRow_right'></td>
                            <tr>
                                <td class='TTRow_left'>(C2) Shares held by Employee Trust</td>
                                <td class='TTRow_right'>1</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'></td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'>0.49</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <tr>
                                    <td class='TTRow_left'>(C) Non Promoter-Non Public</td>
                                    <td class='TTRow_right'>1</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'></td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'>0.49</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <tr>
                                        <td class='TTRow_left'>Grand Total</td>
                                        <td class='TTRow_right'>9,16,078</td>
                                        <td class='TTRow_right'>2,17,06,54,147</td>
                                        <td class='TTRow_right'>1,32,95,642</td>
                                        <td class='TTRow_right'>2,18,39,49,789</td>
                                        <td class='TTRow_right'>100.00</td>
                                        <td class='TTRow_right'>2,17,99,49,744</td>
                                    </tr>
            </table>

You can try this:你可以试试这个:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00').read()), 'lxml')
results = filter(None, [re.sub('[\n\r]+|\s{2,}', '', i.text) for i in s.find_all('td', {'class':re.compile('TTRow_right|TTRow_left')})])

Output:输出:

[u'(A) Promoter & Promoter Group', u'19', u'28,17,02,889', u'28,17,02,889', u'12.90', u'28,17,02,889', u'(B) Public', u'9,16,058', u'1,87,81,45,362', u'1,32,95,642', u'1,89,14,41,004', u'86.61', u'1,88,74,40,959', u'(C1) Shares underlying DRs', u'0.00', u'(C2) Shares held by Employee Trust', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'(C) Non Promoter-Non Public', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'Grand Total', u'9,16,078', u'2,17,06,54,147', u'1,32,95,642', u'2,18,39,49,789', u'100.00', u'2,17,99,49,744']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM