如何根據給定文本中的多個子標題為每個新文本動態拆分文本？

Question

我有一個看起來像這樣的原始文本：

我們是 AMS。 我們是一家全球性的勞動力整體解決方案公司； 我們通過建立、重塑和優化員工隊伍，使組織能夠在不斷變化的時代蓬勃發展。 我們的臨時勞動力解決方案 (CWS) 是我們的服務之一； 我們作為客戶招聘團隊的延伸，提供專業的臨時和臨時資源。

我們目前正在與我們的客戶皇家倫敦合作。

Royal London 是一家與眾不同的金融服務公司。 作為英國最大的共同人壽、養老金和投資公司，我們為會員所有，並為他們的利益而非股東利益而工作。 我們發展迅速，被公認為英國最受好評的工作場所之一。

如今，皇家倫敦管理着超過 1140 億英鎊的資金，在英國和愛爾蘭的六個辦事處擁有約 3,500 名員工。 我們努力成為我們專業市場的專家，建立一個值得信賴的品牌——我們的團隊為此獲得了很多獎項。 無論你有興趣加入什么團隊，扮演什么角色； 我們將幫助您有所作為。

我們正在為倫敦的 6 個月合同尋找業務分析師。

角色的目的：

您將與內部數據小組合作，研究業務中的新功能和相關報告。 部分項目將涉及系統升級

作為業務分析師，您將負責：

查看數據集、提取信息並能夠查看 SQL 腳本、編寫報告序列、分析數據。 能夠理解和交付數據，提出問題和挑戰要求，了解數據旅程/映射文檔。

我們向您尋求的技能、屬性和能力包括：

強大的口頭和書面溝通
Scrum 團隊內部、與其他 BA 以及直接與業務用戶的強大團隊合作
豐富的資產管理經驗
資產經理使用的關鍵數據集的工作知識
擁有主數據管理工具的經驗，最好是 IHS Markit EDM
敏捷的工作經驗
能夠編寫用戶故事以詳細說明開發團隊和* QA 團隊將使用的需求
強大的 SQL 技能，最好使用 Microsoft SQL Server
管理數據接口映射文檔的經驗
熟悉數據建模概念
基於ETL和數據倉庫的項目經驗優勢
技術（開發）背景優勢
有資產管理背景。
Thinkfolio 和 Murex 將是理想的，EDM 平台知識將是可取的。 此客戶將僅接受通過參與模式操作的員工。

如果您有興趣申請此職位並符合上述標准，請單擊鏈接申請並立即與我們的一位采購專家交談。

AMS 是一家招聘流程外包公司，在提供其某些服務時，可能會被視為作為職業介紹所或職業企業運營

我已經使用下面的方法從原始 html 中使用漂亮的湯根據子標題拆分和提取文本。 基本上，目標是：

用粗體文本分隔 html 提取。
從這個粗體文本列表中，提取那些既粗體又帶有“：”的文本以表示它是合法的子標題
然后從粗體文本列表中找出第一個和最后一個合法子標題的位置。 如果在最后一個子標題的文本下方還有其他粗體文本缺少“：”，這將有助於拆分文本。
根據最后一個子標題確實是粗體文本列表中的最后一個元素的條件進行拆分，如果不是，則進一步拆分文本以將子標題的文本與其他文本分開。

下面的代碼演示了這一點：

from fake_useragent import UserAgent
import requests
def headers():
    ua = UserAgent()
    chrome_header = ua.chrome
    headers = {'User-Agent': chrome_header}
    return headers

headers = headers()

r5 = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)

soup_description = BS(r5.text, 'html.parser')
j_description = soup_description.find('span', {'itemprop':'description'})
j_description_subtitles = [j.text for j in j_description.find_all('strong')]
sub_titles_in_description = [el for el in j_description_subtitles if ":" in el]

total_length_of_sub_titles = len(sub_titles_in_description)
total_length_of_strong_tags = len(j_description_subtitles)
Position_of_first_sub_title = j_description_subtitles.index(sub_titles_in_description[0])
Position_of_last_sub_title = j_description_subtitles.index(sub_titles_in_description[-1])

# If the position of the last subtitle text does not equal the total number of strong tags, then split the final output by the next indexed position in the list.
if Position_of_last_sub_title != total_length_of_strong_tags:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}| {j_description_subtitles[Position_of_last_sub_title+1]}',j_description.text)[1:Position_of_last_sub_title]
else:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}',j_description.text)[1:]

final_dict_with_sub_t_n_prec_txt= {
    sub_titles_in_description[0]: text_after_sub_t[0],
    sub_titles_in_description[1]: text_after_sub_t[1],
    sub_titles_in_description[2]: text_after_sub_t[2]
    
}

問題是基於子標題的文本拆分。 它太手動了，並且嘗試了其他方法無濟於事。 我將如何使這部分動態化，因為在未來的文本中，字幕的數量會有所不同。

Answer 1

您可以通過使用css selectors來選擇元素來簡化或使其更通用，例如p:has(strong:-soup-contains(":"))將選擇所有<p>具有子<strong>的: 。 使用find_next_sibling()獲取附加信息：

dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

注意：已添加| 作為get_text()的分隔符，因此在這種情況下，您可以稍后拆分列表元素。 您也可以將其替換為空格get_text(' ',strip=True)

例子

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)

soup = BeautifulSoup(r.text, 'html.parser')

data = dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

print(data)

輸出

{'Purpose of the Role:': 'You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades',
 'As the Business Analyst, you will be responsible for:': 'Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.',
 'The skills, attributes and capabilities we are seeking from you include:': 'Strong communication both verbal and written|Strong teamworking within the scrum team and with other BAs and directly with business users|Significant asset management experience|Working knowledge of the key data sets that are used by an asset manager|Experience of Master Data Management tools, ideally IHS Markit EDM|Agile working experience|Ability to write user stories to detail the requirements that both the development team and the QA team will use|Strong SQL skills, ideally using Microsoft SQL Server|Experience of managing data interface mapping documentation|Familiarity with data modelling concepts|Project experience based on ETL and Data Warehousing advantageous|Technical (development) background advantageous|Have an asset management background.|Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable.'}

如何根據給定文本中的多個子標題為每個新文本動態拆分文本？

問題描述

1 個解決方案

解決方案1
1 已采納 2022-05-15 05:13:40

例子

輸出

如何根據給定文本中的多個子標題為每個新文本動態拆分文本？

問題描述

1 個解決方案

解決方案1 1 已采納 2022-05-15 05:13:40

例子

輸出

解決方案1
1 已采納 2022-05-15 05:13:40