簡體   English   中英

Python從html提取信息

[英]Python extract information from html

我有這樣的html頁面,基本上是關於Microsoft Wiki網站的Wikipedia的右側框:

<tbody>
<tr>
    <td class="logo" colspan="2" style="text-align:center">
        <a class="image" href="/wiki/File:Microsoft_logo_(2012).svg" title="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right."><img alt="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right." data-file-height="109" data-file-width="512" decoding="async" height="47" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/220px-Microsoft_logo_%282012%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/330px-Microsoft_logo_%282012%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/440px-Microsoft_logo_%282012%29.svg.png 2x" width="220" /></a>
        <div>Microsoft's logo since 2012</div>
    </td>
</tr>
<tr>
    <td class="logo" colspan="2" style="text-align:center">
        <a class="image" href="/wiki/File:Building92microsoft.jpg"><img alt="Building92microsoft.jpg" data-file-height="3456" data-file-width="5184" decoding="async" height="147" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/220px-Building92microsoft.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/330px-Building92microsoft.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/440px-Building92microsoft.jpg 2x" width="220" /></a>
        <div>Building 92 on the <a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">Microsoft Redmond campus</a> in <a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond, Washington</a></div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/List_of_legal_entity_types_by_country" title="List of legal entity types by country">Type</a></div>
    </th>
    <td class="category" style="line-height:1.35em;"><a href="/wiki/Public_company" title="Public company">Public</a></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/Ticker_symbol" title="Ticker symbol">Traded as</a></th>
    <td style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/NASDAQ" title="NASDAQ">NASDAQ</a>: <a class="external text" href="https://www.nasdaq.com/symbol/msft" rel="nofollow">MSFT</a></li>
                <li><a href="/wiki/NASDAQ-100" title="NASDAQ-100">NASDAQ-100</a> component</li>
                <li><a href="/wiki/Dow_Jones_Industrial_Average" title="Dow Jones Industrial Average">DJIA</a> component</li>
                <li><a href="/wiki/S%26P_100" title="S&amp;P 100">S&amp;P 100</a> component</li>
                <li><a class="mw-redirect" href="/wiki/S%26P_500" title="S&amp;P 500">S&amp;P 500</a> component</li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/International_Securities_Identification_Number" title="International Securities Identification Number">ISIN</a></th>
    <td style="line-height:1.35em;"><span class="plainlinks nourlexpansion"><a class="external text" href="https://tools.wmflabs.org/isin/?language=de&amp;isin=US5949181045">US5949181045</a></span></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Industry</th>
    <td class="category" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a class="mw-redirect" href="/wiki/Computer_software" title="Computer software">Computer software</a></li>
                <li><a href="/wiki/Computer_hardware" title="Computer hardware">Computer hardware</a></li>
                <li><a href="/wiki/Consumer_electronics" title="Consumer electronics">Consumer electronics</a></li>
                <li><a href="/wiki/Social_networking_service" title="Social networking service">Social networking service</a></li>
                <li><a href="/wiki/Cloud_computing" title="Cloud computing">Cloud computing</a></li>
                <li><a href="/wiki/Video_game_industry" title="Video game industry">Video games</a></li>
                <li><a href="/wiki/Internet" title="Internet">Internet</a></li>
                <li><a href="/wiki/Corporate_venture_capital" title="Corporate venture capital">Corporate venture capital</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Founded</th>
    <td style="line-height:1.35em;">April 4, 1975<span class="noprint">; 44 years ago</span><span style="display:none"> (<span class="bday dtstart published updated">1975-04-04</span>)</span> in <a href="/wiki/Albuquerque,_New_Mexico" title="Albuquerque, New Mexico">Albuquerque, New Mexico</a>, U.S.</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Founders</th>
    <td class="agent" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/Bill_Gates" title="Bill Gates">Bill Gates</a></li>
                <li><a href="/wiki/Paul_Allen" title="Paul Allen">Paul Allen</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Headquarters</th>
    <td class="label" style="line-height:1.35em;"><a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">One Microsoft Way</a>,
        <div class="locality" style="display:inline"><a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond</a>, <a href="/wiki/Washington_(state)" title="Washington (state)">Washington</a></div>,
        <div class="country-name" style="display:inline">U.S.</div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Area served</div>
    </th>
    <td style="line-height:1.35em;">Worldwide</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Key people</div>
    </th>
    <td class="agent" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/John_W._Thompson" title="John W. Thompson">John W. Thompson</a>
                    <br/>(<a class="mw-redirect" href="/wiki/Chairman" title="Chairman">Chairman</a>)</li>
                <li><a href="/wiki/Satya_Nadella" title="Satya Nadella">Satya Nadella</a>
                    <br/>(<a href="/wiki/Chief_executive_officer" title="Chief executive officer">CEO</a>)</li>
                <li><a href="/wiki/Brad_Smith_(American_lawyer)" title="Brad Smith (American lawyer)">Brad Smith</a>
                    <br/>(<a href="/wiki/President_(corporate_title)" title="President (corporate title)">President</a>)</li>
                <li>Bill Gates
                    <br/>(<a href="/wiki/Technical_advisor" title="Technical advisor">Technical Advisor</a>)</li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Products</th>
    <td style="line-height:1.35em;">
        <div class="hlist">
            <ul>
                <li><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a></li>
                <li><a href="/wiki/Microsoft_Office" title="Microsoft Office">Office</a></li>
                <li><a href="/wiki/Microsoft_Servers" title="Microsoft Servers">Servers</a></li>
                <li><a href="/wiki/Skype" title="Skype">Skype</a></li>
                <li><a href="/wiki/Microsoft_Visual_Studio" title="Microsoft Visual Studio">Visual Studio</a></li>
                <li><a href="/wiki/Microsoft_Dynamics" title="Microsoft Dynamics">Dynamics</a></li>
                <li><a href="/wiki/Xbox" title="Xbox">Xbox</a></li>
                <li><a href="/wiki/Microsoft_Surface" title="Microsoft Surface">Surface</a></li>
                <li><a href="/wiki/Microsoft_Mobile" title="Microsoft Mobile">Mobile</a></li>
                <li><a href="/wiki/List_of_Microsoft_software" title="List of Microsoft software">List of software</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Services</th>
    <td class="category" style="line-height:1.35em;">
        <div class="hlist">
            <ul>
                <li><a href="/wiki/Microsoft_Azure" title="Microsoft Azure">Azure</a></li>
                <li><a href="/wiki/Bing_(search_engine)" title="Bing (search engine)">Bing</a></li>
                <li><a href="/wiki/LinkedIn" title="LinkedIn">LinkedIn</a></li>
                <li><a href="/wiki/Microsoft_Developer_Network" title="Microsoft Developer Network">MSDN</a></li>
                <li><a href="/wiki/Office_365" title="Office 365">Office 365</a></li>
                <li><a href="/wiki/OneDrive" title="OneDrive">OneDrive</a></li>
                <li><a href="/wiki/Outlook.com" title="Outlook.com">Outlook.com</a></li>
                <li><a href="/wiki/Microsoft_TechNet" title="Microsoft TechNet">TechNet</a></li>
                <li><a href="/wiki/Microsoft_Pay" title="Microsoft Pay">Pay</a></li>
                <li><a href="/wiki/Microsoft_Store_(digital)" title="Microsoft Store (digital)">Microsoft Store</a></li>
                <li><a href="/wiki/Windows_Update" title="Windows Update">Windows Update</a></li>
                <li><a href="/wiki/Xbox_Live" title="Xbox Live">Xbox Live</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Revenue</th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap"><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>125.8 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-0"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Earnings_before_interest_and_taxes" title="Earnings before interest and taxes">Operating income</a></div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$43.0 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-1"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Net_income" title="Net income">Net income</a></div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$39.2 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-2"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Asset" title="Asset">Total assets</a></span></th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$286.55 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-3"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Equity_(finance)" title="Equity (finance)">Total equity</a></span></th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$102.33 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-4"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> 144,106<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/Subsidiary" title="Subsidiary">Subsidiaries</a></th>
    <td style="line-height:1.35em;"><a href="/wiki/List_of_mergers_and_acquisitions_by_Microsoft" title="List of mergers and acquisitions by Microsoft">List of Microsoft assets</a></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Website</th>
    <td style="line-height:1.35em;"><span class="url"><a class="external text" href="https://www.microsoft.com/" rel="nofollow">microsoft.com</a></span></td>
</tr>
</tbody>

我怎樣才能使這些HTML代碼像這樣的表:

如果失敗,我嘗試使用熊貓read_html。 然后,我使用了beautifulsoup,它具有許多標簽,在某些情況下,Wiki具有與Microsoft頁面中不同的其他標簽。 基本上,我想提取標簽的內部文本。 我如何使用python並考慮潛在的更多不同標簽來做到這一點 我怎樣才能使這些HTML代碼這樣的表

碼:

它使用BeautifulSoup找到第一台和th td每排。

有些tdli ,需要下一個循環。

# https://2.python-requests.org/en/master/
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/

import requests
from bs4 import BeautifulSoup as BS

url = 'https://en.wikipedia.org/wiki/Microsoft'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

all_tables = soup.find_all('table')

all_rows = all_tables[0].find_all('tr')
for row in all_rows:

    th = row.find('th')
    if not th:
        continue

    title = th.text

    td = row.find('td')
    all_li = td.find_all('li')

    if all_li:
        for item in all_li:
            print(title, '>', item.get_text())
    else:
        print(title, '>', td.get_text())

結果:

Type > Public
Traded as > NASDAQ: MSFT
Traded as > NASDAQ-100 component
Traded as > DJIA component
Traded as > S&P 100 component
Traded as > S&P 500 component
ISIN > US5949181045
Industry > Computer software
Industry > Computer hardware
Industry > Consumer electronics
Industry > Social networking service
Industry > Cloud computing
Industry > Video games
Industry > Internet
Industry > Corporate venture capital
Founded > April 4, 1975; 44 years ago (1975-04-04) in Albuquerque, New Mexico, U.S.
Founders > Bill Gates
Founders > Paul Allen
Headquarters > One Microsoft Way, Redmond, Washington, U.S.
Area served > Worldwide
Key people > John W. Thompson(Chairman)
Key people > Satya Nadella(CEO)
Key people > Brad Smith(President)
Key people > Bill Gates(Technical Advisor)
Products > Windows
Products > Office
Products > Servers
Products > Skype
Products > Visual Studio
Products > Dynamics
Products > Xbox
Products > Surface
Products > Mobile
Products > List of software
Services > Azure
Services > Bing
Services > LinkedIn
Services > MSDN
Services > Office 365
Services > OneDrive
Services > Outlook.com
Services > TechNet
Services > Pay
Services > Microsoft Store
Services > Windows Update
Services > Xbox Live
Revenue >  US$125.8 billion[1] (2019)
Operating income >  US$43.0 billion[1] (2019)
Net income >  US$39.2 billion[1] (2019)
Total assets >  US$286.55 billion[1] (2019)
Total equity >  US$102.33 billion[1] (2019)
Number of employees >  144,106[2] (2019)
Subsidiaries > List of Microsoft assets
Website > microsoft.com

有些管線仍需要單獨清潔。 所有這些都沒有一個規則,因此它們將需要單獨的代碼。

這是獲得相同結果的另一種方法。 不過,需要做一些清潔工作。

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/Microsoft"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table',class_='vcard').find_all('tr'):
    [i.extract() for i in items.select("a[href^='#cite']")]
    data = items.find_all(['th','td'])
    title = data[0].text
    if not len(data)>=2:continue
    product = ' '.join([' '.join(item.split()) for item in data[1].strings]).strip()
    print("{} | {}".format(title,product)) 

輸出:

Type | Public
Traded as | NASDAQ : MSFT NASDAQ-100 component DJIA component S&P 100 component S&P 500 component
ISIN | US5949181045
Industry | Computer software Computer hardware Consumer electronics Social networking service Cloud computing Video games Internet Corporate venture capital
Founded | April 4, 1975 ; 44 years ago ( 1975-04-04 ) in Albuquerque, New Mexico , U.S.
Founders | Bill Gates  Paul Allen
Headquarters | One Microsoft Way , Redmond , Washington , U.S.
Area served | Worldwide
Key people | John W. Thompson ( Chairman )  Satya Nadella ( CEO )  Brad Smith ( President )  Bill Gates ( Technical Advisor )
Products | Windows  Office  Servers  Skype  Visual Studio  Dynamics  Xbox  Surface  Mobile  List of software
Services | Azure  Bing  LinkedIn  MSDN  Office 365  OneDrive  Outlook.com  TechNet  Pay  Microsoft Store  Windows Update  Xbox Live
Revenue | US$ 125.8 billion (2019)
Operating income | US$43.0 billion (2019)
Net income | US$39.2 billion (2019)
Total assets | US$286.55 billion (2019)
Total equity | US$102.33 billion (2019)
Number of employees | 144,106 (2019)
Subsidiaries | List of Microsoft assets
Website | microsoft.com

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM