简体   繁体   中英

Python extract information from html

I have a html page as such, basically it's the right side box of wikipedia about Microsoft wiki site :

<tbody>
<tr>
    <td class="logo" colspan="2" style="text-align:center">
        <a class="image" href="/wiki/File:Microsoft_logo_(2012).svg" title="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right."><img alt="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right." data-file-height="109" data-file-width="512" decoding="async" height="47" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/220px-Microsoft_logo_%282012%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/330px-Microsoft_logo_%282012%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/440px-Microsoft_logo_%282012%29.svg.png 2x" width="220" /></a>
        <div>Microsoft's logo since 2012</div>
    </td>
</tr>
<tr>
    <td class="logo" colspan="2" style="text-align:center">
        <a class="image" href="/wiki/File:Building92microsoft.jpg"><img alt="Building92microsoft.jpg" data-file-height="3456" data-file-width="5184" decoding="async" height="147" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/220px-Building92microsoft.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/330px-Building92microsoft.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/440px-Building92microsoft.jpg 2x" width="220" /></a>
        <div>Building 92 on the <a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">Microsoft Redmond campus</a> in <a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond, Washington</a></div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/List_of_legal_entity_types_by_country" title="List of legal entity types by country">Type</a></div>
    </th>
    <td class="category" style="line-height:1.35em;"><a href="/wiki/Public_company" title="Public company">Public</a></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/Ticker_symbol" title="Ticker symbol">Traded as</a></th>
    <td style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/NASDAQ" title="NASDAQ">NASDAQ</a>: <a class="external text" href="https://www.nasdaq.com/symbol/msft" rel="nofollow">MSFT</a></li>
                <li><a href="/wiki/NASDAQ-100" title="NASDAQ-100">NASDAQ-100</a> component</li>
                <li><a href="/wiki/Dow_Jones_Industrial_Average" title="Dow Jones Industrial Average">DJIA</a> component</li>
                <li><a href="/wiki/S%26P_100" title="S&amp;P 100">S&amp;P 100</a> component</li>
                <li><a class="mw-redirect" href="/wiki/S%26P_500" title="S&amp;P 500">S&amp;P 500</a> component</li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/International_Securities_Identification_Number" title="International Securities Identification Number">ISIN</a></th>
    <td style="line-height:1.35em;"><span class="plainlinks nourlexpansion"><a class="external text" href="https://tools.wmflabs.org/isin/?language=de&amp;isin=US5949181045">US5949181045</a></span></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Industry</th>
    <td class="category" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a class="mw-redirect" href="/wiki/Computer_software" title="Computer software">Computer software</a></li>
                <li><a href="/wiki/Computer_hardware" title="Computer hardware">Computer hardware</a></li>
                <li><a href="/wiki/Consumer_electronics" title="Consumer electronics">Consumer electronics</a></li>
                <li><a href="/wiki/Social_networking_service" title="Social networking service">Social networking service</a></li>
                <li><a href="/wiki/Cloud_computing" title="Cloud computing">Cloud computing</a></li>
                <li><a href="/wiki/Video_game_industry" title="Video game industry">Video games</a></li>
                <li><a href="/wiki/Internet" title="Internet">Internet</a></li>
                <li><a href="/wiki/Corporate_venture_capital" title="Corporate venture capital">Corporate venture capital</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Founded</th>
    <td style="line-height:1.35em;">April 4, 1975<span class="noprint">; 44 years ago</span><span style="display:none"> (<span class="bday dtstart published updated">1975-04-04</span>)</span> in <a href="/wiki/Albuquerque,_New_Mexico" title="Albuquerque, New Mexico">Albuquerque, New Mexico</a>, U.S.</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Founders</th>
    <td class="agent" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/Bill_Gates" title="Bill Gates">Bill Gates</a></li>
                <li><a href="/wiki/Paul_Allen" title="Paul Allen">Paul Allen</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Headquarters</th>
    <td class="label" style="line-height:1.35em;"><a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">One Microsoft Way</a>,
        <div class="locality" style="display:inline"><a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond</a>, <a href="/wiki/Washington_(state)" title="Washington (state)">Washington</a></div>,
        <div class="country-name" style="display:inline">U.S.</div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Area served</div>
    </th>
    <td style="line-height:1.35em;">Worldwide</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Key people</div>
    </th>
    <td class="agent" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/John_W._Thompson" title="John W. Thompson">John W. Thompson</a>
                    <br/>(<a class="mw-redirect" href="/wiki/Chairman" title="Chairman">Chairman</a>)</li>
                <li><a href="/wiki/Satya_Nadella" title="Satya Nadella">Satya Nadella</a>
                    <br/>(<a href="/wiki/Chief_executive_officer" title="Chief executive officer">CEO</a>)</li>
                <li><a href="/wiki/Brad_Smith_(American_lawyer)" title="Brad Smith (American lawyer)">Brad Smith</a>
                    <br/>(<a href="/wiki/President_(corporate_title)" title="President (corporate title)">President</a>)</li>
                <li>Bill Gates
                    <br/>(<a href="/wiki/Technical_advisor" title="Technical advisor">Technical Advisor</a>)</li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Products</th>
    <td style="line-height:1.35em;">
        <div class="hlist">
            <ul>
                <li><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a></li>
                <li><a href="/wiki/Microsoft_Office" title="Microsoft Office">Office</a></li>
                <li><a href="/wiki/Microsoft_Servers" title="Microsoft Servers">Servers</a></li>
                <li><a href="/wiki/Skype" title="Skype">Skype</a></li>
                <li><a href="/wiki/Microsoft_Visual_Studio" title="Microsoft Visual Studio">Visual Studio</a></li>
                <li><a href="/wiki/Microsoft_Dynamics" title="Microsoft Dynamics">Dynamics</a></li>
                <li><a href="/wiki/Xbox" title="Xbox">Xbox</a></li>
                <li><a href="/wiki/Microsoft_Surface" title="Microsoft Surface">Surface</a></li>
                <li><a href="/wiki/Microsoft_Mobile" title="Microsoft Mobile">Mobile</a></li>
                <li><a href="/wiki/List_of_Microsoft_software" title="List of Microsoft software">List of software</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Services</th>
    <td class="category" style="line-height:1.35em;">
        <div class="hlist">
            <ul>
                <li><a href="/wiki/Microsoft_Azure" title="Microsoft Azure">Azure</a></li>
                <li><a href="/wiki/Bing_(search_engine)" title="Bing (search engine)">Bing</a></li>
                <li><a href="/wiki/LinkedIn" title="LinkedIn">LinkedIn</a></li>
                <li><a href="/wiki/Microsoft_Developer_Network" title="Microsoft Developer Network">MSDN</a></li>
                <li><a href="/wiki/Office_365" title="Office 365">Office 365</a></li>
                <li><a href="/wiki/OneDrive" title="OneDrive">OneDrive</a></li>
                <li><a href="/wiki/Outlook.com" title="Outlook.com">Outlook.com</a></li>
                <li><a href="/wiki/Microsoft_TechNet" title="Microsoft TechNet">TechNet</a></li>
                <li><a href="/wiki/Microsoft_Pay" title="Microsoft Pay">Pay</a></li>
                <li><a href="/wiki/Microsoft_Store_(digital)" title="Microsoft Store (digital)">Microsoft Store</a></li>
                <li><a href="/wiki/Windows_Update" title="Windows Update">Windows Update</a></li>
                <li><a href="/wiki/Xbox_Live" title="Xbox Live">Xbox Live</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Revenue</th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap"><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>125.8 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-0"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Earnings_before_interest_and_taxes" title="Earnings before interest and taxes">Operating income</a></div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$43.0 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-1"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Net_income" title="Net income">Net income</a></div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$39.2 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-2"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Asset" title="Asset">Total assets</a></span></th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$286.55 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-3"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Equity_(finance)" title="Equity (finance)">Total equity</a></span></th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$102.33 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-4"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> 144,106<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/Subsidiary" title="Subsidiary">Subsidiaries</a></th>
    <td style="line-height:1.35em;"><a href="/wiki/List_of_mergers_and_acquisitions_by_Microsoft" title="List of mergers and acquisitions by Microsoft">List of Microsoft assets</a></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Website</th>
    <td style="line-height:1.35em;"><span class="url"><a class="external text" href="https://www.microsoft.com/" rel="nofollow">microsoft.com</a></span></td>
</tr>
</tbody>

How can I make a table like this out these html code:

I tried to use pandas read_html, if failed. Then I used beautifulsoup, it has many tags, and in some cases, wiki has more different tags other than these in Microsoft page. Basiclly, I want to extract the very inner text of the tags. How could I do this using python, and considering potentially many more different tags . 我怎样才能使这些HTML代码这样的表

Code:

It uses BeautifulSoup to find first table and th td in every row.

Some td have li which need next loop.

# https://2.python-requests.org/en/master/
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/

import requests
from bs4 import BeautifulSoup as BS

url = 'https://en.wikipedia.org/wiki/Microsoft'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

all_tables = soup.find_all('table')

all_rows = all_tables[0].find_all('tr')
for row in all_rows:

    th = row.find('th')
    if not th:
        continue

    title = th.text

    td = row.find('td')
    all_li = td.find_all('li')

    if all_li:
        for item in all_li:
            print(title, '>', item.get_text())
    else:
        print(title, '>', td.get_text())

Result:

Type > Public
Traded as > NASDAQ: MSFT
Traded as > NASDAQ-100 component
Traded as > DJIA component
Traded as > S&P 100 component
Traded as > S&P 500 component
ISIN > US5949181045
Industry > Computer software
Industry > Computer hardware
Industry > Consumer electronics
Industry > Social networking service
Industry > Cloud computing
Industry > Video games
Industry > Internet
Industry > Corporate venture capital
Founded > April 4, 1975; 44 years ago (1975-04-04) in Albuquerque, New Mexico, U.S.
Founders > Bill Gates
Founders > Paul Allen
Headquarters > One Microsoft Way, Redmond, Washington, U.S.
Area served > Worldwide
Key people > John W. Thompson(Chairman)
Key people > Satya Nadella(CEO)
Key people > Brad Smith(President)
Key people > Bill Gates(Technical Advisor)
Products > Windows
Products > Office
Products > Servers
Products > Skype
Products > Visual Studio
Products > Dynamics
Products > Xbox
Products > Surface
Products > Mobile
Products > List of software
Services > Azure
Services > Bing
Services > LinkedIn
Services > MSDN
Services > Office 365
Services > OneDrive
Services > Outlook.com
Services > TechNet
Services > Pay
Services > Microsoft Store
Services > Windows Update
Services > Xbox Live
Revenue >  US$125.8 billion[1] (2019)
Operating income >  US$43.0 billion[1] (2019)
Net income >  US$39.2 billion[1] (2019)
Total assets >  US$286.55 billion[1] (2019)
Total equity >  US$102.33 billion[1] (2019)
Number of employees >  144,106[2] (2019)
Subsidiaries > List of Microsoft assets
Website > microsoft.com

Some lines still need individual cleaning. There is no one rule for all of them so they will need individual code.

Here is another approach to get the same results. A little cleaning to do, though.

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/Microsoft"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table',class_='vcard').find_all('tr'):
    [i.extract() for i in items.select("a[href^='#cite']")]
    data = items.find_all(['th','td'])
    title = data[0].text
    if not len(data)>=2:continue
    product = ' '.join([' '.join(item.split()) for item in data[1].strings]).strip()
    print("{} | {}".format(title,product)) 

Output:

Type | Public
Traded as | NASDAQ : MSFT NASDAQ-100 component DJIA component S&P 100 component S&P 500 component
ISIN | US5949181045
Industry | Computer software Computer hardware Consumer electronics Social networking service Cloud computing Video games Internet Corporate venture capital
Founded | April 4, 1975 ; 44 years ago ( 1975-04-04 ) in Albuquerque, New Mexico , U.S.
Founders | Bill Gates  Paul Allen
Headquarters | One Microsoft Way , Redmond , Washington , U.S.
Area served | Worldwide
Key people | John W. Thompson ( Chairman )  Satya Nadella ( CEO )  Brad Smith ( President )  Bill Gates ( Technical Advisor )
Products | Windows  Office  Servers  Skype  Visual Studio  Dynamics  Xbox  Surface  Mobile  List of software
Services | Azure  Bing  LinkedIn  MSDN  Office 365  OneDrive  Outlook.com  TechNet  Pay  Microsoft Store  Windows Update  Xbox Live
Revenue | US$ 125.8 billion (2019)
Operating income | US$43.0 billion (2019)
Net income | US$39.2 billion (2019)
Total assets | US$286.55 billion (2019)
Total equity | US$102.33 billion (2019)
Number of employees | 144,106 (2019)
Subsidiaries | List of Microsoft assets
Website | microsoft.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM