简体   繁体   English

需要帮助加入字典项目并删除换行符和多个空格和特殊字符

[英]Need help joining dictionary items and remove newlines and multiple spaces and special characters

Dictionary with 2 urls and their text: Need to get rid of all multiple spaces, special characters and new lines带有 2 个 url 及其文本的字典:需要去掉所有的多个空格、特殊字符和换行符

{' https://firsturl.com ': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n\n ', '\n', '\n', '\n ', '\n ', 'Home | {' https://firsturl.com ': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n \n ', '\n', '\n', '\n ', '\n ', '首页 | Sam ModelInc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n\n\n \n \n', '\n', '\n', '\n', '\n', '\n ', '\n ', 'Skip to main content'],' https://secondurl.com#main-content ': ['\n\n', '\n ', '\n \n \n ', '\n \n ', '\n \n ', '\n\n ', '\n', '\n', '\n ', '\n ', 'Home | Sam ModelInc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n \n\n \n \n', '\n', '\n', '\n', '\n', '\n ', '\n ', '跳到主要内容'],' https ://secondurl.com#main-content ': ['\n\n', '\n', '\n \n \n ', '\n \n ', '\n \n ', '\ n\n ', '\n', '\n', '\n ', '\n ', '首页 | Going to start inc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n\n\n\n \n \n', '\n', '\n', '\n', '\n', '\n ', '\n ', 'Skip to main content', '\n ', '\n \n', '\n\n ', '\n\n ', '\n \n \n \n \n ', '\n\n ', '\n ', '\n\n \n ', '\n ', '\n\n \n ', '\n ', 'Brands', '\n', 'About Us', '\n', 'Syndication', '\n', 'Direct Response']}将开始 inc', '\n \n\n\n', '\n\n\n\n', '\n\n', '\n \n\n\n\n\n\n \n\n\n \n \n', '\n', '\n', '\n', '\n', '\n', '\n', '跳到主要内容', ' \n ', '\n \n', '\n\n ', '\n\n ', '\n \n \n \n \n ', '\n\n ', '\n ', '\n\n \n ', '\n ', '\n\n \n ', '\n ', '品牌', '\n', '关于我们', '\n', 'Syndication' , '\n', '直接响应']}

Expected Output: {' https://firsturl.com ': ['home sam modelInc skip to main content'], https://secondurl.com#main-content ': ['home going to start inc skip to main content brands about us syndication direct response]} Expected Output: {' https://firsturl.com ': ['home sam modelInc skip to main content'], https://secondurl.com#main-content ': ['home going to start inc skip to main content关于我们的品牌联合直接响应]}

Help would be much appreciated帮助将不胜感激

So let's try to walk through this instead of just throwing some code at you.因此,让我们尝试逐步完成此过程,而不是仅仅向您抛出一些代码。

The first element we want to get rid of is the newline.我们要删除的第一个元素是换行符。 So we could start with something like:因此,我们可以从以下内容开始:

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]
    ex_dict[x] = new_list

If you run that, you'll see that we now filter out all new lines.如果你运行它,你会看到我们现在过滤掉了所有新行。

Now we have the following cases:现在我们有以下情况:

Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response

According to your expected output, you want to lowercase all words and remove non-alphabet characters.根据您预期的 output,您希望将所有单词小写并删除非字母字符。

Did a little research for how to do that.对如何做到这一点进行了一些研究

In code, that looks like:在代码中,它看起来像:

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]
    ex_dict[x] = new_list

so now our final new_list looks something like: ['Home Sam ModelInc', 'Skip to main content']所以现在我们最终的new_list看起来像: ['Home Sam ModelInc', 'Skip to main content']

Next we want to lowercase everything.接下来我们要将所有内容都小写。

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]
    ex_dict[x] = new_list

and lastly we want to combine everything with only one space between each word.最后,我们希望将所有内容组合在一起,每个单词之间只有一个空格。

import re

regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces

ex_dict = {"a": ["\n\n", "\n"]}

for x in ex_dict:
    new_list = [e for e in ex_dict[x] if "\n" not in e]

    """
    >>> regex.sub("", "Home | Sam ModelInc")
    'Home  Sam ModelInc'
    """
    new_list = [regex.sub("", e) for e in new_list]

    new_list = [e.lower() for e in new_list]

    new_list = [" ".join((" ".join(new_list)).split())]
    ex_dict[x] = new_list

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM