简体   繁体   English

从HTML标签和纯文本中提取文本(不包含在标签中)

[英]Extract text from HTML Tags and plain text (not wrapped in tags)

<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a> 
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a> 
from one's 
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>

I am trying to reconstruct the sentence "to pay charges from one's bank account" that's split into the above HTML code. 我正在尝试重构“从一个人的银行帐户支付费用”这句话,该句子被分成上面的HTML代码。 My problem is that one part of the sentence is not wrapped inside HTML tags. 我的问题是句子的一部分没有包装在HTML标签内。 When I try to use: 当我尝试使用时:

BeautifulSoup.find_all()

I only get the text between the link tags and when I try to use 我只在链接标记和我尝试使用时获取文本

BeautifulSoup.contents

I only get "from one's" but not the text in between the link tags. 我只得到“来自一个人”,而不是链接标签之间的文本。

Is there a way to go through this code and reconstruct the sentence? 有没有办法通过这个代码并重建句子?

Edit: The above code is just an example, I am trying to scrape a dictionary so the order of the strings and which parts will be inside/outside tags will be arbitrary. 编辑:上面的代码只是一个例子,我试图刮字典,所以字符串的顺序和哪些部分将是内部/外部标签将是任意的。

Edit: After digging into the dictionary website a bit, I came up with the following solution. 编辑:稍微深入字典网站后,我想出了以下解决方案。 Under a each <p> tag of a sentence, we could do the following: 在句子的每个<p>标签下,我们可以执行以下操作:

from bs4.element import Tag
from bs4.element import NavigableString


res = []

for segment in p.contents:
    if isinstance(segment, NavigableString):
        res.append(segment)
    elif isinstance(segment, Tag):
        res.append(segment.text)

final_sentence = ''.join(res[:-2])

Hope it helps 希望能帮助到你


If you just want to extract text from title attribute, you could do 如果您只想从title属性中提取文本,则可以这样做

# assuming text is the html text given above
soup = BeautifulSoup(text, 'html5lib')
a_tags = soup.select('a')
a_strs = (a['title'] for a in a_tags)
final_sentence = "{} {} from one's {}".format(a_strs)
from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account 

Another approach to achieve the same: 另一种实现相同的方法:

from bs4 import BeautifulSoup

content = """
<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>
"""
soup = BeautifulSoup(content,"lxml")
print(soup.get_text(" ",strip=True))

Output: 输出:

to pay charges from one's bank account

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM