简体   繁体   English

Python从html提取信息

[英]Python extract information from html

I have a html page as such, basically it's the right side box of wikipedia about Microsoft wiki site : 我有这样的html页面,基本上是关于Microsoft Wiki网站的Wikipedia的右侧框:

<tbody>
<tr>
    <td class="logo" colspan="2" style="text-align:center">
        <a class="image" href="/wiki/File:Microsoft_logo_(2012).svg" title="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right."><img alt="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right." data-file-height="109" data-file-width="512" decoding="async" height="47" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/220px-Microsoft_logo_%282012%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/330px-Microsoft_logo_%282012%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/440px-Microsoft_logo_%282012%29.svg.png 2x" width="220" /></a>
        <div>Microsoft's logo since 2012</div>
    </td>
</tr>
<tr>
    <td class="logo" colspan="2" style="text-align:center">
        <a class="image" href="/wiki/File:Building92microsoft.jpg"><img alt="Building92microsoft.jpg" data-file-height="3456" data-file-width="5184" decoding="async" height="147" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/220px-Building92microsoft.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/330px-Building92microsoft.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/440px-Building92microsoft.jpg 2x" width="220" /></a>
        <div>Building 92 on the <a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">Microsoft Redmond campus</a> in <a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond, Washington</a></div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/List_of_legal_entity_types_by_country" title="List of legal entity types by country">Type</a></div>
    </th>
    <td class="category" style="line-height:1.35em;"><a href="/wiki/Public_company" title="Public company">Public</a></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/Ticker_symbol" title="Ticker symbol">Traded as</a></th>
    <td style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/NASDAQ" title="NASDAQ">NASDAQ</a>: <a class="external text" href="https://www.nasdaq.com/symbol/msft" rel="nofollow">MSFT</a></li>
                <li><a href="/wiki/NASDAQ-100" title="NASDAQ-100">NASDAQ-100</a> component</li>
                <li><a href="/wiki/Dow_Jones_Industrial_Average" title="Dow Jones Industrial Average">DJIA</a> component</li>
                <li><a href="/wiki/S%26P_100" title="S&amp;P 100">S&amp;P 100</a> component</li>
                <li><a class="mw-redirect" href="/wiki/S%26P_500" title="S&amp;P 500">S&amp;P 500</a> component</li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/International_Securities_Identification_Number" title="International Securities Identification Number">ISIN</a></th>
    <td style="line-height:1.35em;"><span class="plainlinks nourlexpansion"><a class="external text" href="https://tools.wmflabs.org/isin/?language=de&amp;isin=US5949181045">US5949181045</a></span></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Industry</th>
    <td class="category" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a class="mw-redirect" href="/wiki/Computer_software" title="Computer software">Computer software</a></li>
                <li><a href="/wiki/Computer_hardware" title="Computer hardware">Computer hardware</a></li>
                <li><a href="/wiki/Consumer_electronics" title="Consumer electronics">Consumer electronics</a></li>
                <li><a href="/wiki/Social_networking_service" title="Social networking service">Social networking service</a></li>
                <li><a href="/wiki/Cloud_computing" title="Cloud computing">Cloud computing</a></li>
                <li><a href="/wiki/Video_game_industry" title="Video game industry">Video games</a></li>
                <li><a href="/wiki/Internet" title="Internet">Internet</a></li>
                <li><a href="/wiki/Corporate_venture_capital" title="Corporate venture capital">Corporate venture capital</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Founded</th>
    <td style="line-height:1.35em;">April 4, 1975<span class="noprint">; 44 years ago</span><span style="display:none"> (<span class="bday dtstart published updated">1975-04-04</span>)</span> in <a href="/wiki/Albuquerque,_New_Mexico" title="Albuquerque, New Mexico">Albuquerque, New Mexico</a>, U.S.</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Founders</th>
    <td class="agent" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/Bill_Gates" title="Bill Gates">Bill Gates</a></li>
                <li><a href="/wiki/Paul_Allen" title="Paul Allen">Paul Allen</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Headquarters</th>
    <td class="label" style="line-height:1.35em;"><a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">One Microsoft Way</a>,
        <div class="locality" style="display:inline"><a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond</a>, <a href="/wiki/Washington_(state)" title="Washington (state)">Washington</a></div>,
        <div class="country-name" style="display:inline">U.S.</div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Area served</div>
    </th>
    <td style="line-height:1.35em;">Worldwide</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Key people</div>
    </th>
    <td class="agent" style="line-height:1.35em;">
        <div class="plainlist">
            <ul>
                <li><a href="/wiki/John_W._Thompson" title="John W. Thompson">John W. Thompson</a>
                    <br/>(<a class="mw-redirect" href="/wiki/Chairman" title="Chairman">Chairman</a>)</li>
                <li><a href="/wiki/Satya_Nadella" title="Satya Nadella">Satya Nadella</a>
                    <br/>(<a href="/wiki/Chief_executive_officer" title="Chief executive officer">CEO</a>)</li>
                <li><a href="/wiki/Brad_Smith_(American_lawyer)" title="Brad Smith (American lawyer)">Brad Smith</a>
                    <br/>(<a href="/wiki/President_(corporate_title)" title="President (corporate title)">President</a>)</li>
                <li>Bill Gates
                    <br/>(<a href="/wiki/Technical_advisor" title="Technical advisor">Technical Advisor</a>)</li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Products</th>
    <td style="line-height:1.35em;">
        <div class="hlist">
            <ul>
                <li><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a></li>
                <li><a href="/wiki/Microsoft_Office" title="Microsoft Office">Office</a></li>
                <li><a href="/wiki/Microsoft_Servers" title="Microsoft Servers">Servers</a></li>
                <li><a href="/wiki/Skype" title="Skype">Skype</a></li>
                <li><a href="/wiki/Microsoft_Visual_Studio" title="Microsoft Visual Studio">Visual Studio</a></li>
                <li><a href="/wiki/Microsoft_Dynamics" title="Microsoft Dynamics">Dynamics</a></li>
                <li><a href="/wiki/Xbox" title="Xbox">Xbox</a></li>
                <li><a href="/wiki/Microsoft_Surface" title="Microsoft Surface">Surface</a></li>
                <li><a href="/wiki/Microsoft_Mobile" title="Microsoft Mobile">Mobile</a></li>
                <li><a href="/wiki/List_of_Microsoft_software" title="List of Microsoft software">List of software</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Services</th>
    <td class="category" style="line-height:1.35em;">
        <div class="hlist">
            <ul>
                <li><a href="/wiki/Microsoft_Azure" title="Microsoft Azure">Azure</a></li>
                <li><a href="/wiki/Bing_(search_engine)" title="Bing (search engine)">Bing</a></li>
                <li><a href="/wiki/LinkedIn" title="LinkedIn">LinkedIn</a></li>
                <li><a href="/wiki/Microsoft_Developer_Network" title="Microsoft Developer Network">MSDN</a></li>
                <li><a href="/wiki/Office_365" title="Office 365">Office 365</a></li>
                <li><a href="/wiki/OneDrive" title="OneDrive">OneDrive</a></li>
                <li><a href="/wiki/Outlook.com" title="Outlook.com">Outlook.com</a></li>
                <li><a href="/wiki/Microsoft_TechNet" title="Microsoft TechNet">TechNet</a></li>
                <li><a href="/wiki/Microsoft_Pay" title="Microsoft Pay">Pay</a></li>
                <li><a href="/wiki/Microsoft_Store_(digital)" title="Microsoft Store (digital)">Microsoft Store</a></li>
                <li><a href="/wiki/Windows_Update" title="Windows Update">Windows Update</a></li>
                <li><a href="/wiki/Xbox_Live" title="Xbox Live">Xbox Live</a></li>
            </ul>
        </div>
    </td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Revenue</th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap"><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>125.8 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-0"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Earnings_before_interest_and_taxes" title="Earnings before interest and taxes">Operating income</a></div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$43.0 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-1"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Net_income" title="Net income">Net income</a></div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$39.2 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-2"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Asset" title="Asset">Total assets</a></span></th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$286.55 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-3"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Equity_(finance)" title="Equity (finance)">Total equity</a></span></th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$102.33 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-4"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">
        <div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
    </th>
    <td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> 144,106<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> (2019)</td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;"><a href="/wiki/Subsidiary" title="Subsidiary">Subsidiaries</a></th>
    <td style="line-height:1.35em;"><a href="/wiki/List_of_mergers_and_acquisitions_by_Microsoft" title="List of mergers and acquisitions by Microsoft">List of Microsoft assets</a></td>
</tr>
<tr>
    <th scope="row" style="padding-right:0.5em;">Website</th>
    <td style="line-height:1.35em;"><span class="url"><a class="external text" href="https://www.microsoft.com/" rel="nofollow">microsoft.com</a></span></td>
</tr>
</tbody>

How can I make a table like this out these html code: 我怎样才能使这些HTML代码像这样的表:

I tried to use pandas read_html, if failed. 如果失败,我尝试使用熊猫read_html。 Then I used beautifulsoup, it has many tags, and in some cases, wiki has more different tags other than these in Microsoft page. 然后,我使用了beautifulsoup,它具有许多标签,在某些情况下,Wiki具有与Microsoft页面中不同的其他标签。 Basiclly, I want to extract the very inner text of the tags. 基本上,我想提取标签的内部文本。 How could I do this using python, and considering potentially many more different tags . 我如何使用python并考虑潜在的更多不同标签来做到这一点 我怎样才能使这些HTML代码这样的表

Code: 码:

It uses BeautifulSoup to find first table and th td in every row. 它使用BeautifulSoup找到第一台和th td每排。

Some td have li which need next loop. 有些tdli ,需要下一个循环。

# https://2.python-requests.org/en/master/
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/

import requests
from bs4 import BeautifulSoup as BS

url = 'https://en.wikipedia.org/wiki/Microsoft'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

all_tables = soup.find_all('table')

all_rows = all_tables[0].find_all('tr')
for row in all_rows:

    th = row.find('th')
    if not th:
        continue

    title = th.text

    td = row.find('td')
    all_li = td.find_all('li')

    if all_li:
        for item in all_li:
            print(title, '>', item.get_text())
    else:
        print(title, '>', td.get_text())

Result: 结果:

Type > Public
Traded as > NASDAQ: MSFT
Traded as > NASDAQ-100 component
Traded as > DJIA component
Traded as > S&P 100 component
Traded as > S&P 500 component
ISIN > US5949181045
Industry > Computer software
Industry > Computer hardware
Industry > Consumer electronics
Industry > Social networking service
Industry > Cloud computing
Industry > Video games
Industry > Internet
Industry > Corporate venture capital
Founded > April 4, 1975; 44 years ago (1975-04-04) in Albuquerque, New Mexico, U.S.
Founders > Bill Gates
Founders > Paul Allen
Headquarters > One Microsoft Way, Redmond, Washington, U.S.
Area served > Worldwide
Key people > John W. Thompson(Chairman)
Key people > Satya Nadella(CEO)
Key people > Brad Smith(President)
Key people > Bill Gates(Technical Advisor)
Products > Windows
Products > Office
Products > Servers
Products > Skype
Products > Visual Studio
Products > Dynamics
Products > Xbox
Products > Surface
Products > Mobile
Products > List of software
Services > Azure
Services > Bing
Services > LinkedIn
Services > MSDN
Services > Office 365
Services > OneDrive
Services > Outlook.com
Services > TechNet
Services > Pay
Services > Microsoft Store
Services > Windows Update
Services > Xbox Live
Revenue >  US$125.8 billion[1] (2019)
Operating income >  US$43.0 billion[1] (2019)
Net income >  US$39.2 billion[1] (2019)
Total assets >  US$286.55 billion[1] (2019)
Total equity >  US$102.33 billion[1] (2019)
Number of employees >  144,106[2] (2019)
Subsidiaries > List of Microsoft assets
Website > microsoft.com

Some lines still need individual cleaning. 有些管线仍需要单独清洁。 There is no one rule for all of them so they will need individual code. 所有这些都没有一个规则,因此它们将需要单独的代码。

Here is another approach to get the same results. 这是获得相同结果的另一种方法。 A little cleaning to do, though. 不过,需要做一些清洁工作。

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/Microsoft"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table',class_='vcard').find_all('tr'):
    [i.extract() for i in items.select("a[href^='#cite']")]
    data = items.find_all(['th','td'])
    title = data[0].text
    if not len(data)>=2:continue
    product = ' '.join([' '.join(item.split()) for item in data[1].strings]).strip()
    print("{} | {}".format(title,product)) 

Output: 输出:

Type | Public
Traded as | NASDAQ : MSFT NASDAQ-100 component DJIA component S&P 100 component S&P 500 component
ISIN | US5949181045
Industry | Computer software Computer hardware Consumer electronics Social networking service Cloud computing Video games Internet Corporate venture capital
Founded | April 4, 1975 ; 44 years ago ( 1975-04-04 ) in Albuquerque, New Mexico , U.S.
Founders | Bill Gates  Paul Allen
Headquarters | One Microsoft Way , Redmond , Washington , U.S.
Area served | Worldwide
Key people | John W. Thompson ( Chairman )  Satya Nadella ( CEO )  Brad Smith ( President )  Bill Gates ( Technical Advisor )
Products | Windows  Office  Servers  Skype  Visual Studio  Dynamics  Xbox  Surface  Mobile  List of software
Services | Azure  Bing  LinkedIn  MSDN  Office 365  OneDrive  Outlook.com  TechNet  Pay  Microsoft Store  Windows Update  Xbox Live
Revenue | US$ 125.8 billion (2019)
Operating income | US$43.0 billion (2019)
Net income | US$39.2 billion (2019)
Total assets | US$286.55 billion (2019)
Total equity | US$102.33 billion (2019)
Number of employees | 144,106 (2019)
Subsidiaries | List of Microsoft assets
Website | microsoft.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM