簡體   English   中英

如何將html中的多行段落合二為一?

[英]How to combine multiple lines of paragraph in html into one?

我有一個 html 文件,其中包含 pdf 文件的標題和段落。 但是在這個文件中,每一行段落都被認為是另一個段落,這就是為什么它給出了很多

標記行,因此不可能創建多行的單個段落。 任何人都可以建議我解決這個問題的方法嗎?

這是我得到的方式:

["<p>Forti provides access to a diverse array of Forti solutions through a single sign-on ",
  "<p>including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti ",
  "<p>cloud-based management and services. Forti accounts are free which require a license for ",
  "<p>each solution. "]

我想要這樣的地方:

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, FortiWeb Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']

我已經這樣做了:

paragraphs_1 = []
local_path = "file.json"
data = json.loads(open(local_path).read())
for x in data:
    soup = BeautifulSoup(x, 'html.parser') 
    for paragraphs in soup.find_all("p"): 
        paragraphs_1.append(paragraphs.get_text())

您可以使用替換 function 來擺脫所有 p...like

yourtext.replace("<p>", "") 

試試這個代碼:

new_list = []
for text in my_list_of_text:
    # first remove <p>
    new_list.append(text.replace('<p>', ''))
# next step create a long text using list comprehension
listToStr = ' '.join([str(elem) for elem in new_list]) 
# remove possible double space
final_text= listToStr.replace('  ', ' ')   

例如,使用 simplenlg 有更復雜的方式。 但是對於您的問題,此代碼應該足夠了。

下面的 function 有助於清理raw_html標簽

import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

如果你想組合列表中的多個元素並將其作為單個段落返回,你可以嘗試.join()

paragraph = cleanhtml(str(''.join(para)))

Output:

'Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. '

或者

將其作為列表返回

paragraph = [cleanhtml(str(''.join(para)))]

Output

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM