[英]how to filter out correct words with spaces from a lxml data string using beatifulsoup
hi guyz here i'm getting on string that contains lots of html data (in single string)嗨,guyz,我正在使用包含大量 html 数据的字符串(在单个字符串中)
from bs4 import BeautifulSoup
import requests
import bs4
url = "any randome url"
html = requests.get(url).text
soup = BeautifulSoup(html,'lxml')
web_page=soup.get_text().strip()
print(web_page.lower())
and some of the words are coming in output like并且有些词出现在 output 之类的
conditionstravel
for conditions
& travel
conditionstravel
for conditions
& travel
vaccinationstreatment
for vaccination
& treatment
vaccinationstreatment
接种治疗 疫苗vaccination
和treatment
the web page is scraping is correct, but this is not expected, web 页面是正确的,但这不是预期的,
bcoz some of the tags are ending with text conditions
and next tag is starting text with travels
so that's why it's coming like conditionstravel
bcoz 一些标签以文本
conditions
结尾,下一个标签以travels
开始文本,所以这就是为什么它像conditionstravel
旅行一样出现
here i'm willing scrape the web page by one by one tags and make it as a web_page_data_list so is there any way to scrape all tags texts with separate state like above在这里,我愿意将 web 页面逐个标签刮掉,并使其成为 web_page_data_list 所以有什么方法可以像上面一样用单独的 state刮掉所有标签文本
and the problem is we can't give specific dictionary words for this is that possible with beautiful soup or any other package will help to extract this properly.?问题是我们不能给出具体的字典单词,因为这可能是用漂亮的汤或任何其他 package 将有助于正确提取这个。
Use separator=' '
parameter in .get_text()
method.在
.get_text()
方法中使用separator=' '
参数。 Also you can supply strip=True
to strip whitespace characters automatically of every separated word.您还可以提供
strip=True
自动去除每个分隔单词的空白字符。
For example:例如:
import bs4
from bs4 import BeautifulSoup
txt = '''<div>Hello<span>World</span></div>'''
soup = BeautifulSoup(txt, 'html.parser')
web_page=soup.get_text(strip=True, separator=' ')
print(web_page.lower())
print(bs4.__version__)
Prints:印刷:
hello world
4.9.1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.