美麗的湯解析表列和剝離換行符

Question

我正在使用以下代碼循環遍歷 html 表的每一行和每一列

data = []
table = page.find('table', attrs={'class':'table table-no-border table-hover table-striped keyword_result_table'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

這個表列給了我一些

    <td class="keyword">
     <span class="is_in_saved_list" id="is_in_saved_list_81864060">
     </span>
     <a href="javascript:void(0);">
      <b>
       what
      </b>
      <b>
       is
      </b>
      <b>
       in
      </b>
      <b>
       house
      </b>
      <b>
       paint
      </b>
     </a>
    </td>

output 作為

['\n \n\n 是什么\n \n\n 在\n \n\n 房子\n \n\n 油漆', '5756', '979', '2', '很棒', ' 89', '.com\n\n\n.net\n\n\n.org']

在控制台和這里的提示屏幕上，似乎有制表符空間，但它們沒有顯示在帖子中。 我在 strip() 之后嘗試過 rstrip() 但沒有改變。 有沒有辦法只抓取鏈接所附的文本內容？

Answer 1

您可以使用.stripped_strings獲取沒有任何空格/制表符的文本。

這是代碼：

import bs4 as bs

s = """
 <td class="keyword">
     <span class="is_in_saved_list" id="is_in_saved_list_81864060">
     </span>
     <a href="javascript:void(0);">
      <b>
       what
      </b>
      <b>
       is
      </b>
      <b>
       in
      </b>
      <b>
       house
      </b>
      <b>
       paint
      </b>
     </a>
    </td>
    """
soup = bs.BeautifulSoup(s, 'lxml')
t = soup.find('td')
print(list(t.stripped_strings))

['what', 'is', 'in', 'house', 'paint']

Answer 2

您是否嘗試從字符串中刪除 '\n' ？

s = 'what\n \n\n is\n \n\n in\n \n\n house\n \n\n paint'
s.replace('\n', '')
'what  is  in  house  paint'

美麗的湯解析表列和剝離換行符

問題描述

2 個解決方案

解決方案1
1 2021-10-08 05:49:28

解決方案2
0 2021-10-07 23:53:07

美麗的湯解析表列和剝離換行符

問題描述

2 個解決方案

解決方案1 1 2021-10-08 05:49:28

解決方案2 0 2021-10-07 23:53:07

解決方案1
1 2021-10-08 05:49:28

解決方案2
0 2021-10-07 23:53:07