简体   繁体   English

使用Beautiful Soup解析特定数据

[英]Parsing specific data using Beautiful Soup

So I have a webpage which has tabular data in it. 所以我有一个网页,其中包含表格数据。 The following is the HTML code for the table: 以下是该表的HTML代码:

    <table class="confluenceTable">
    <tbody>
       <tr>
          <th class="confluenceTh">
             <p>Prefix</p>
          </th>
          <th class="confluenceTh">
             <p>Group</p>
          </th>
          <th class="confluenceTh">
             <p>Contact</p>
          </th>
          <th class="confluenceTh">
             <p>Dev/Test Lab</p>
          </th>
          <th class="confluenceTh">
             <p>Performance</p>
          </th>
       </tr>
       <tr>
          <td class="confluenceTd">
             <p> </p>
          </td>
          <td class="confluenceTd">
             <p> </p>
          </td>
          <td class="confluenceTd">
             <p> </p>
          </td>
       </tr>
       <tr>
          <th class="confluenceTh">
             <p> </p>
          </th>
          <th class="confluenceTh">
             <p> </p>
          </th>
          <th class="confluenceTh">
             <p> </p>
          </th>
       </tr>
       <tr>
          <td class="confluenceTd">
             <p>SEF00</p>
          </td>
          <td class="confluenceTd">
             <p>APTRA Vision</p>
          </td>
          <td class="confluenceTd">
             <p> </p>
          </td>
          <td class="confluenceTd">
             <p><a href="/somepage">VCD Lab</a> , <a href="/somepage">Test Lab</a></p>
          </td>
          <td class="confluenceTd">
             <p><a href="/display">Perf Lab</a></p>
          </td>
       </tr>
       <tr>
          <td class="confluenceTd">
             <p>SEF01</p>
          </td>
          <td class="confluenceTd">
             <p>In-Person Bill Payment</p>
          </td>
          <td class="confluenceTd">
             <p>Swamy PKV</p>
          </td>

How can I format my Python code so that I just get all data underneath Prefix and Group columns. 如何格式化我的Python代码,以便仅将所有数据保存在Prefix和Group列下面。 So far I have tried this: 到目前为止,我已经尝试过了:

ii=1
data=requests.get(url,auth=(username,password))
sample=data.content
soup=BeautifulSoup(sample,'html.parser')
for row in soup.find_all('tr')[1:154]:
     datatocheck.append(row.get_text(separator='\t'))
while(ii<=152):
        print datatocheck[ii][0:30]
        ii+=1

This gives me the following output: 这给了我以下输出:

SEF00   APTRA Vision        VCD Lab  
SEF01   In-Person Bill Payment  S

But I just want SEF00 (prefix) and APTRA Vision (group), SEF01 and In-Person Bill Payment . 但是我只想要SEF00 (前缀)和APTRA Vision (小组), SEF01In-Person Bill Payment Not the other columns. 没有其他列。

Also, I cant change my HTML code. 另外,我无法更改HTML代码。

How about if u do If SEF00 in ii: 如果您在ii中使用SEF00怎么办:

It may print just the SEF00 它可能只打印SEF00

soup = BeautifulSoup(html, 'lxml')

for row in soup.find_all('tr')[3:]:   # remove empty row
    tds = [i.get_text(strip=True) for i in row.find_all('td')]
    print(tds[0],tds[1])

out: 出:

SEF00 APTRA Vision
SEF01 In-Person Bill Payment

just get all the td in the row, put them in a list, than slice it 只需将所有td放入行中,然后将它们放入列表中,然后将其切成薄片即可

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM