簡體   English   中英

需要幫助加入字典項目並刪除換行符和多個空格和特殊字符

[英]Need help joining dictionary items and remove newlines and multiple spaces and special characters

帶有 2 個 url 及其文本的字典:需要去掉所有的多個空格、特殊字符和換行符

{' https://firsturl.com ': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n \n ', '\n', '\n', '\n ', '\n ', '首頁 | Sam ModelInc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n \n\n \n \n', '\n', '\n', '\n', '\n', '\n ', '\n ', '跳到主要內容'],' https ://secondurl.com#main-content ': ['\n\n', '\n', '\n \n \n ', '\n \n ', '\n \n ', '\ n\n ', '\n', '\n', '\n ', '\n ', '首頁 | 將開始 inc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n \n\n\n \n \n', '\n', '\n', '\n', '\n', '\n', '\n', '跳到主要內容', ' \n ', '\n \n', '\n\n ', '\n\n ', '\n \n \n \n \n ', '\n\n ', '\n ', '\n\n \n ', '\n ', '\n\n \n ', '\n ', '品牌', '\n', '關於我們', '\n', 'Syndication' , '\n', '直接響應']}

Expected Output: {' https://firsturl.com ': ['home sam modelInc skip to main content'], https://secondurl.com#main-content ': ['home going to start inc skip to main content關於我們的品牌聯合直接響應]}

幫助將不勝感激

因此,讓我們嘗試逐步完成此過程,而不是僅僅向您拋出一些代碼。

我們要刪除的第一個元素是換行符。 因此,我們可以從以下內容開始:

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]
    ex_dict[x] = new_list

如果你運行它,你會看到我們現在過濾掉了所有新行。

現在我們有以下情況:

Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response

根據您預期的 output,您希望將所有單詞小寫並刪除非字母字符。

對如何做到這一點進行了一些研究

在代碼中,它看起來像:

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]
    ex_dict[x] = new_list

所以現在我們最終的new_list看起來像: ['Home Sam ModelInc', 'Skip to main content']

接下來我們要將所有內容都小寫。

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]
    ex_dict[x] = new_list

最后,我們希望將所有內容組合在一起,每個單詞之間只有一個空格。

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]

    new_list = [" ".join((" ".join(new_list)).split())]
    ex_dict[x] = new_list

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM