[英]How can I append the content of multiple text file to a new text files based on specific conditions in python?
[英]How can I dynamically split the text based on multiple sub-titles in a given text for every new text?
我有一個看起來像這樣的原始文本:
我們是 AMS。 我們是一家全球性的勞動力整體解決方案公司; 我們通過建立、重塑和優化員工隊伍,使組織能夠在不斷變化的時代蓬勃發展。 我們的臨時勞動力解決方案 (CWS) 是我們的服務之一; 我們作為客戶招聘團隊的延伸,提供專業的臨時和臨時資源。
我們目前正在與我們的客戶皇家倫敦合作。
Royal London 是一家與眾不同的金融服務公司。 作為英國最大的共同人壽、養老金和投資公司,我們為會員所有,並為他們的利益而非股東利益而工作。 我們發展迅速,被公認為英國最受好評的工作場所之一。
如今,皇家倫敦管理着超過 1140 億英鎊的資金,在英國和愛爾蘭的六個辦事處擁有約 3,500 名員工。 我們努力成為我們專業市場的專家,建立一個值得信賴的品牌——我們的團隊為此獲得了很多獎項。 無論你有興趣加入什么團隊,扮演什么角色; 我們將幫助您有所作為。
我們正在為倫敦的 6 個月合同尋找業務分析師。
角色的目的:
您將與內部數據小組合作,研究業務中的新功能和相關報告。 部分項目將涉及系統升級
作為業務分析師,您將負責:
查看數據集、提取信息並能夠查看 SQL 腳本、編寫報告序列、分析數據。 能夠理解和交付數據,提出問題和挑戰要求,了解數據旅程/映射文檔。
我們向您尋求的技能、屬性和能力包括:
如果您有興趣申請此職位並符合上述標准,請單擊鏈接申請並立即與我們的一位采購專家交談。
AMS 是一家招聘流程外包公司,在提供其某些服務時,可能會被視為作為職業介紹所或職業企業運營
我已經使用下面的方法從原始 html 中使用漂亮的湯根據子標題拆分和提取文本。 基本上,目標是:
下面的代碼演示了這一點:
from fake_useragent import UserAgent
import requests
def headers():
ua = UserAgent()
chrome_header = ua.chrome
headers = {'User-Agent': chrome_header}
return headers
headers = headers()
r5 = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)
soup_description = BS(r5.text, 'html.parser')
j_description = soup_description.find('span', {'itemprop':'description'})
j_description_subtitles = [j.text for j in j_description.find_all('strong')]
sub_titles_in_description = [el for el in j_description_subtitles if ":" in el]
total_length_of_sub_titles = len(sub_titles_in_description)
total_length_of_strong_tags = len(j_description_subtitles)
Position_of_first_sub_title = j_description_subtitles.index(sub_titles_in_description[0])
Position_of_last_sub_title = j_description_subtitles.index(sub_titles_in_description[-1])
# If the position of the last subtitle text does not equal the total number of strong tags, then split the final output by the next indexed position in the list.
if Position_of_last_sub_title != total_length_of_strong_tags:
text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}| {j_description_subtitles[Position_of_last_sub_title+1]}',j_description.text)[1:Position_of_last_sub_title]
else:
text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}',j_description.text)[1:]
final_dict_with_sub_t_n_prec_txt= {
sub_titles_in_description[0]: text_after_sub_t[0],
sub_titles_in_description[1]: text_after_sub_t[1],
sub_titles_in_description[2]: text_after_sub_t[2]
}
問題是基於子標題的文本拆分。 它太手動了,並且嘗試了其他方法無濟於事。 我將如何使這部分動態化,因為在未來的文本中,字幕的數量會有所不同。
您可以通過使用css selectors
來選擇元素來簡化或使其更通用,例如p:has(strong:-soup-contains(":"))
將選擇所有<p>
具有子<strong>
的:
。 使用find_next_sibling()
獲取附加信息:
dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))
注意:已添加|
作為get_text()
的分隔符,因此在這種情況下,您可以稍后拆分列表元素。 您也可以將其替換為空格get_text(' ',strip=True)
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)
soup = BeautifulSoup(r.text, 'html.parser')
data = dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))
print(data)
{'Purpose of the Role:': 'You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades',
'As the Business Analyst, you will be responsible for:': 'Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.',
'The skills, attributes and capabilities we are seeking from you include:': 'Strong communication both verbal and written|Strong teamworking within the scrum team and with other BAs and directly with business users|Significant asset management experience|Working knowledge of the key data sets that are used by an asset manager|Experience of Master Data Management tools, ideally IHS Markit EDM|Agile working experience|Ability to write user stories to detail the requirements that both the development team and the QA team will use|Strong SQL skills, ideally using Microsoft SQL Server|Experience of managing data interface mapping documentation|Familiarity with data modelling concepts|Project experience based on ETL and Data Warehousing advantageous|Technical (development) background advantageous|Have an asset management background.|Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable.'}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.