[英]How to combine multiple lines of paragraph in html into one?
我有一個 html 文件,其中包含 pdf 文件的標題和段落。 但是在這個文件中,每一行段落都被認為是另一個段落,這就是為什么它給出了很多
標記行,因此不可能創建多行的單個段落。 任何人都可以建議我解決這個問題的方法嗎?
這是我得到的方式:
["<p>Forti provides access to a diverse array of Forti solutions through a single sign-on ",
"<p>including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti ",
"<p>cloud-based management and services. Forti accounts are free which require a license for ",
"<p>each solution. "]
我想要這樣的地方:
['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, FortiWeb Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']
我已經這樣做了:
paragraphs_1 = []
local_path = "file.json"
data = json.loads(open(local_path).read())
for x in data:
soup = BeautifulSoup(x, 'html.parser')
for paragraphs in soup.find_all("p"):
paragraphs_1.append(paragraphs.get_text())
您可以使用替換 function 來擺脫所有 p...like
yourtext.replace("<p>", "")
試試這個代碼:
new_list = []
for text in my_list_of_text:
# first remove <p>
new_list.append(text.replace('<p>', ''))
# next step create a long text using list comprehension
listToStr = ' '.join([str(elem) for elem in new_list])
# remove possible double space
final_text= listToStr.replace(' ', ' ')
例如,使用 simplenlg 有更復雜的方式。 但是對於您的問題,此代碼應該足夠了。
下面的 function 有助於清理raw_html
標簽
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
如果你想組合列表中的多個元素並將其作為單個段落返回,你可以嘗試.join()
paragraph = cleanhtml(str(''.join(para)))
Output:
'Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. '
或者
將其作為列表返回
paragraph = [cleanhtml(str(''.join(para)))]
Output
['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.