简体   繁体   English

HTML解析-在所有标签之间获取文本

[英]HTML parsing - getting text between all tags

I want to get the text between all the tags in a specific tr. 我想获取特定tr中所有标签之间的文本。 i have looked at similar questions but they are specific for a tag type. 我看过类似的问题,但它们特定于标签类型。

If I do something like this : 如果我做这样的事情:

for strong_tag in soup.find_all('strong'):
    print strong_tag.text

That is for a particular tag, but how to do it for the complete tr.? 那是一个特定的标签,但是如何在完整的tr。中做呢?

<tr>
   <td style="border:0px solid black;padding: 0px 5.4pt;border-color: currentColor windowtext windowtext;border-style: none solid solid;border-width: medium 0pt 0pt;background: white;" width="39">
      <p align="center" style="min-height: 8pt; padding: 0px; text-align: center;"> </p>
   </td>
   <td colspan="7" style="border:0px solid black;vertical-align: top;text-align: left;padding: 0px 5.4pt;border-color: currentColor windowtext windowtext currentColor;border-style: none solid solid none;border-width: medium 0pt 0pt medium;background: white;" width="683">
      <ol style="list-style-type: decimal;">
         <li>Process the return per standard procedures. Refer to the <a class="jive-link-wiki-small" data-containerid="2456" data-containertype="14" data-objectid="12425" data-objecttype="102" href="https://iconnect.sprint.com/docs/DOC-12425">Sprint Satisfaction Guarantee Procedure</a> for steps.</li>
         <li>RMS will reset the eligibility when doing a <strong>Sprint Monthly Installments Return</strong>. If the original transaction was performed in RMS, the system will display a message and advise that a history transaction can be performed or you can proceed with a No History Return</li>
         <li>
            To reset Monthly Installments upgrade eligibility and process the return:
            <ol>
               <li>Return the device.</li>
               <li>Re-access the account to see if the line is still <strong>upgrade-eligible for Monthly Installments</strong>.</li>
            </ol>
            <ul>
               <ul>
                  <li><strong>If so,</strong> proceed with the sale as normal.</li>
                  <li>
                     If the customer's line is showing as <strong>not upgrade-eligible</strong> for Monthly Installments:
                     <ol>
                        <li>Add a note to the customer's account stating the return transaction number and the need for eligibility reset.</li>
                        <li>Reset the customer's eligibility by using the MSA tablet or through iCare <em><strong>or</strong></em></li>
                        <li>Contact <strong>NSS</strong> to request an eligibility reset <strong>only</strong> if the reset was <strong>not successful</strong>.<strong> </strong></li>
                     </ol>
                  </li>
               </ul>
               <ul>
                  <li><span style="font-family: Arial;">Once eligibility is reset, pull up the customer's account again in RMS and process the sale.</span></li>
               </ul>
            </ul>
         </li>
      </ol>
   </td>
</tr>

The output expected is : Text between all tags 预期的输出是:所有标签之间的文本

get_text() gets all the child strings and return concatenated using the given separator get_text()获取所有子字符串,并使用给定的分隔符串联返回

text is a property to the get_text method - Undocumented textget_text方法的属性-未记录

print(soup.select('tr')[0].text)

With Alignments 与路线

import bs4
soup=bs4.BeautifulSoup(open('h.html'),'lxml')
def get_text(i):
   r=[]
   for t in i.contents:
      if type(t)==bs4.element.NavigableString:r.append(t.strip())
      elif t.name in ['strong','span'] :r.append(t.text.strip())
   return ' '.join(r)


s=soup.select('li',)
for i in s:
   level=(len(i.find_parents('ol')+i.find_parents('ul')))-1
   print(' '*level*5,get_text(i))
   print('-'*50)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python没有在HTML标签之间获取文本 - Python not getting text between html tags 在 Python 中使用 BeautifulSoup 的所有 HTML 标签之后的 HTML 解析以获取所有带有分隔符的文本数据 - HTML parsing to get all text data with delimiters after all HTML tags using BeautifulSoup in Python 在BeatifulSoup中的一对html标记之间返回所有文本的最简单方法 - simplest way to return all of the text between a pair of html tags in BeatifulSoup 两个标签之间的Python HTML解析 - Python HTML Parsing Between two tags 在 2 个 html 标签之间添加文本 - Adding text between 2 html tags 在两个 html 标签之间获取文本:Python web 抓取(文本在迭代结果集时被跳过) - Getting Text Between two html tags: Python web scraping (Text getting skipped on iterating the result set) 解析和存储HTML标记以及文本 - Parsing and Storing HTML Tags Along With Text 解析HTML,Python中特定标签下的文本 - Parsing for text under specific tags in HTML, Python 使用 pythons lxml 库更正 xpath 语法,用于解析任意嵌套 html 标记中的所有文本 - Correct xpath syntax with pythons lxml library for parsing all the text from arbitrary nested html tags Selenium / XPath 在两个标签之间获取 HTML - Selenium / XPath Getting HTML between two tags
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM