简体   繁体   English

从 HTML 中提取表格信息(作为文本文件)

[英]Extract Table Information from HTML (As Text File)

I am trying to extract information from a table in an html file, I want to use this possible as a text as I can only access this file through VPN so I have downloaded all the necessary html files I need.我正在尝试从 html 文件中的表中提取信息,我想将其用作文本,因为我只能通过 VPN 访问此文件,所以我已经下载了我需要的所有必要的 html 文件。

I want to specifically get the information from various tables of the same table class, however when I try to obtain the information there is nothing being returned.我想专门从同一个表类的各个表中获取信息,但是当我尝试获取信息时没有返回任何内容。 I have attached the code that I was trying to use to obtain this information but have not been successful.我附上了我试图用来获取此信息但没有成功的代码。

Below also is the html file that I have been trying to get the information from, it is quite big however so I hope this to not be a problem下面也是我一直试图从中获取信息的 html 文件,但是它很大,所以我希望这不是问题

Table Information表信息

 <table class="region-table"> <thead> <tr> <th>Region</th> <th>Type</th> <th>From</th> <th>To</th> <th colspan="2">Most similar known cluster</th> <th>Similarity</th> </tr> </thead> <tbody> <tr class="linked-row odd" data-anchor="#r1c1"> <td class="regbutton NRPS-like r1c1"> <a href="#r1c1">Region&nbsp;1.1</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a> </td> <td class="digits">21,469</td> <td class="digits table-split-left">62,957</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001740/1" target="_blank">phthoxazolin</a></td> <td>NRP + Polyketide</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 4%, #ffffff00 4%)">4%</td> </tr> <tr class="linked-row even" data-anchor="#r1c2"> <td class="regbutton NRPS r1c2"> <a href="#r1c2">Region&nbsp;1.2</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a> </td> <td class="digits">74,163</td> <td class="digits table-split-left">124,963</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001709/1" target="_blank">nystatin</a></td> <td>Polyketide</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 10%, #ffffff00 10%)">10%</td> </tr> </tbody> </table> <table class="region-table"> <thead> <tr> <th>Region</th> <th>Type</th> <th>From</th> <th>To</th> <th colspan="2">Most similar known cluster</th> <th>Similarity</th> </tr> </thead> <tbody> <tr class="linked-row odd" data-anchor="#r2c1"> <td class="regbutton terpene r2c1"> <a href="#r2c1">Region&nbsp;2.1</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a> </td> <td class="digits">3,800</td> <td class="digits table-split-left">23,263</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001580/1" target="_blank">ebelactone</a></td> <td>Polyketide</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 5%, #ffffff00 5%)">5%</td> </tr> <tr class="linked-row even" data-anchor="#r2c2"> <td class="regbutton NRPS-like r2c2"> <a href="#r2c2">Region&nbsp;2.2</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps-like" target="_blank">NRPS-like</a> </td> <td class="digits">55,320</td> <td class="digits table-split-left">97,088</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000727/1" target="_blank">indigoidine</a></td> <td>Saccharide</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 17%, #ffffff00 17%)">17%</td> </tr> <tr class="linked-row odd" data-anchor="#r2c3"> <td class="regbutton NRPS r2c3"> <a href="#r2c3">Region&nbsp;2.3</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a> </td> <td class="digits">144,740</td> <td class="digits table-split-left">193,599</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0000368/1" target="_blank">streptobactin</a></td> <td>NRP</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(210, 105, 30, 0.3), rgba(210, 105, 30, 0.3) 70%, #ffffff00 70%)">70%</td> </tr> <tr class="linked-row even" data-anchor="#r2c4"> <td class="regbutton siderophore r2c4"> <a href="#r2c4">Region&nbsp;2.4</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#siderophore" target="_blank">siderophore</a> </td> <td class="digits">347,862</td> <td class="digits table-split-left">362,833</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001593/1" target="_blank">ficellomycin</a></td> <td>NRP</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 3%, #ffffff00 3%)">3%</td> </tr> <tr class="linked-row odd" data-anchor="#r2c5"> <td class="regbutton lassopeptide r2c5"> <a href="#r2c5">Region&nbsp;2.5</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#lassopeptide" target="_blank">lassopeptide</a> </td> <td class="digits">548,017</td> <td class="digits table-split-left">570,561</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001435/1" target="_blank">ikarugamycin</a></td> <td>NRP + Polyketide:Iterative type I</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td> </tr> <tr class="linked-row even" data-anchor="#r2c6"> <td class="regbutton NRPS r2c6"> <a href="#r2c6">Region&nbsp;2.6</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a> </td> <td class="digits">628,834</td> <td class="digits table-split-left">683,050</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0001117/1" target="_blank">himastatin</a></td> <td>NRP</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 12%, #ffffff00 12%)">12%</td> </tr> <tr class="linked-row odd" data-anchor="#r2c7"> <td class="regbutton NRPS,terpene hybrid r2c7"> <a href="#r2c7">Region&nbsp;2.7</a> </td> <td> <a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#nrps" target="_blank">NRPS</a>,<a class="external-link" href="https://docs.antismash.secondarymetabolites.org/glossary/#terpene" target="_blank">terpene</a> </td> <td class="digits">1,043,511</td> <td class="digits table-split-left">1,104,786</td> <td><a class="external-link" href="https://mibig.secondarymetabolites.org/go/BGC0002024/1" target="_blank">nargenicin</a></td> <td>Polyketide</td> <td class="digits similarity-text" style="background-image: linear-gradient(to left, rgba(205, 92, 92, 0.3), rgba(205, 92, 92, 0.3) 11%, #ffffff00 11%)">11%</td> </tr> </tbody> </table>

Code Snippet代码片段

soup = BeautifulSoup(html, "lxml")
gdp_table = soup.find("table", attrs={"class": "region-table"})
gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows
# Get all the headings of Lists
print ("Extracted {num} Region-Tables".format(num=len(gdp_table_data)))
print(gdp_table_data[0]) #print first table
print(gdp_table_data[1]) #print second table

Ideally I would want to input the html file and extract all the different tables information, merge as one big table and output as csv possibly.理想情况下,我想输入 html 文件并提取所有不同的表信息,合并为一个大表并可能输出为 csv。

Take HTML data from the file and export a separate csv.从文件中获取 HTML 数据并导出单独的 csv。

import csv
from simplified_scrapy import SimplifiedDoc,req,utils
name = 'test.html'
html = utils.getFileContent(name) # Get data from file
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
    trs = table.tbody.trs
    for tr in trs:
        rows.append([td.text for td in tr.tds])
with open(name+'.csv','w',encoding='utf-8') as f: 
    csv_writer = csv.writer(f)
    csv_writer.writerows(rows)

If you want to keep one file per table如果您想为每个表保留一个文件

doc = SimplifiedDoc(html)
i=0
tables = doc.selects('table.region-table')
for table in tables:
    i+=1
    rows = []
    trs = table.tbody.trs
    for tr in trs:
        rows.append([td.text for td in tr.tds])
    with open(name+str(i)+'.csv','w',encoding='utf-8') as f: 
        csv_writer = csv.writer(f)
        csv_writer.writerows(rows)

Keep the original one for comparison.保留原件以供比较。

import csv
from simplified_scrapy import SimplifiedDoc,req
html = '''''' # Your HTML
doc = SimplifiedDoc(html)
rows = []
tables = doc.selects('table.region-table')
for table in tables:
    trs = table.tbody.trs
    for tr in trs:
        rows.append([td.text for td in tr.tds])
 # If you have '>Region.*?</a>' in each row, you can get all the rows directly in the following way
 # trs = doc.getElementsByReg('>Region.*?</a>',tag='tr')
 # for tr in trs:
    # rows.append([td.text for td in tr.tds])
with open('test.csv','w',encoding='utf-8') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerows(rows)

Result:结果:

Region 1.1,NRPS-like,"21,469","62,957",phthoxazolin,NRP + Polyketide,4%
Region 1.2,NRPS,"74,163","124,963",nystatin,Polyketide,10%
Region 2.1,terpene,"3,800","23,263",ebelactone,Polyketide,5%
Region 2.2,NRPS-like,"55,320","97,088",indigoidine,Saccharide,17%
Region 2.3,NRPS,"144,740","193,599",streptobactin,NRP,70%
Region 2.4,siderophore,"347,862","362,833",ficellomycin,NRP,3%
Region 2.5,lassopeptide,"548,017","570,561",ikarugamycin,NRP + Polyketide:Iterative type I,12%
Region 2.6,NRPS,"628,834","683,050",himastatin,NRP,12%
Region 2.7,"NRPS,terpene","1,043,511","1,104,786",nargenicin,Polyketide,11%

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM