如何使用 beatifulsoup 从 lxml 数据字符串中过滤出带有空格的正确单词

Question

hi guyz here i'm getting on string that contains lots of html data (in single string)嗨，guyz，我正在使用包含大量 html 数据的字符串（在单个字符串中）

from bs4 import BeautifulSoup
import requests
import bs4
url = "any randome url"
html = requests.get(url).text
soup = BeautifulSoup(html,'lxml')
web_page=soup.get_text().strip()
print(web_page.lower())

and some of the words are coming in output like并且有些词出现在 output 之类的

conditionstravel for conditions & travel conditionstravel for conditions & travel

vaccinationstreatment for vaccination & treatment vaccinationstreatment接种治疗疫苗vaccination和treatment

the web page is scraping is correct, but this is not expected, web 页面是正确的，但这不是预期的，

bcoz some of the tags are ending with text conditions and next tag is starting text with travels so that's why it's coming like conditionstravel bcoz 一些标签以文本conditions结尾，下一个标签以travels开始文本，所以这就是为什么它像conditionstravel旅行一样出现

here i'm willing scrape the web page by one by one tags and make it as a web_page_data_list so is there any way to scrape all tags texts with separate state like above在这里，我愿意将 web 页面逐个标签刮掉，并使其成为 web_page_data_list 所以有什么方法可以像上面一样用单独的 state刮掉所有标签文本

and the problem is we can't give specific dictionary words for this is that possible with beautiful soup or any other package will help to extract this properly.?问题是我们不能给出具体的字典单词，因为这可能是用漂亮的汤或任何其他 package 将有助于正确提取这个。

Answer 1

Use separator=' ' parameter in .get_text() method.在.get_text()方法中使用separator=' '参数。 Also you can supply strip=True to strip whitespace characters automatically of every separated word.您还可以提供strip=True自动去除每个分隔单词的空白字符。

For example:例如：

import bs4
from bs4 import BeautifulSoup

txt = '''<div>Hello<span>World</span></div>'''

soup = BeautifulSoup(txt, 'html.parser')
web_page=soup.get_text(strip=True, separator=' ')
print(web_page.lower())
print(bs4.__version__)

Prints:印刷：

hello world
4.9.1

如何使用 beatifulsoup 从 lxml 数据字符串中过滤出带有空格的正确单词

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-02 07:30:08

如何使用 beatifulsoup 从 lxml 数据字符串中过滤出带有空格的正确单词

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-02 07:30:08

解决方案1
0 已采纳 2020-07-02 07:30:08