简体   繁体   English

Python:在列表中替换\\ n \\ r \\ t,但不包括以\\ n \\ n开头并以\\ n \\ r \\ n \\ t结尾的

[英]Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

My List: 我的列表:

['\\n\\r\\n\\tThis article is about sweet bananas. ['\\ n \\ r \\ n \\ t本文是关于甜香蕉的。 For the genus to which banana plants belong, see Musa (genus).\\n\\r\\n\\tFor starchier bananas used in cooking, see Cooking banana. 有关香蕉植物所属的属,请参阅Musa(属)。\\ n \\ r \\ n \\ t有关烹饪中使用的淀粉质香蕉,请参阅烹饪香蕉。 For other uses, see Banana (disambiguation)\\n\\r\\n\\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\\n\\r\\n\\tThey are grown in 135 countries.\\n\\n\\n\\r\\n\\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\\n\\nDescription\\n\\r\\n\\tThe banana plant is the largest herbaceous flowering plant.\\n\\r\\n\\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\\n\\nEtymology\\n\\r\\n\\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\\n'] 其他用途,请参见香蕉(消除歧义)\\ n \\ r \\ n \\ tMusa物种原产于热带Indomalaya和澳大利亚,很可能最早在巴布亚新几内亚驯化了。\\ n \\ r \\ n \\ t 135个国家/地区。\\ n \\ n \\ n \\ r \\ n \\ t全球范围内,“香蕉”和“大蕉”之间没有明显的区别。\\ n \\ n说明\\ n \\ r \\ n \\ t香蕉植物是最大的草本开花植物。\\ n \\ r \\ n \\ t香蕉植物的所有地上部分均来自通常称为“球茎”的结构。\\ n \\ n词源\\ n \\ r \\ n \\ t香蕉一词被认为是西非的起源,可能来自沃洛夫(Wolof)单词banaana,然后通过西班牙语或葡萄牙语传递给英语。\\ n']

Example code: 示例代码:

import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
    list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)

After I scraping a web page, I need to do data cleansing. 抓取网页后,需要进行数据清理。 I want to replace all the "\\r" , "\\n" , "\\t" as "" , but I found that I have subtitle in this, if I do this, subtitles and sentences are going to mix together . 我想将所有的"\\r""\\n""\\t"都替换为"" ,但是我发现其中包含字幕,如果这样做, 字幕和句子将混合在一起

Every subtitle always starts with \\n\\n and ends with \\n\\r\\n\\t , is it possible that I can do something to distinguish them in this list like \\aEtymology\\a . 每个字幕总是以\\n\\n开头,以\\n\\r\\n\\t结尾,是否有可能我可以做一些区分它们的事情,例如\\aEtymology\\a It's not going to work if I replace \\n\\n and \\n\\r\\n\\t separately to \\a first cause other parts might have the same elements like this \\n\\n\\r and it will become like \\a\\r . 它不会,如果我更换工作\\n\\n\\n\\r\\n\\t分别以\\a第一因其他部分可能有这样相同的元素\\n\\n\\r ,它将变得像\\a\\r Thanks in advance! 提前致谢!

Approach 途径

  1. Replace the subtitles to a custom string, <subtitles> in the list 将字幕替换为自定义字符串,列表中的<subtitles>
  2. Replace the \\n , \\r , \\t etc. in the list 替换列表中的\\n\\r\\t
  3. Replace the custom string with the actual subtitle 用实际的字幕替换自定义字符串

Code

l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']

import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)

for x in regex:
    l = [r.replace(x,"<subtitles>") for r in l]

rep = ['\n','\t','\r']
for y in rep:
    l = [r.replace(y, '') for r in l]

for x in regex:
    l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)

Output 产量

['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']

['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re    

print([re.sub(r'[\n\r\t]', '', c) for c in list])

I think you may use regex 我认为您可以使用正则表达式

You can do this by using regular expressions: 您可以使用正则表达式来做到这一点:

import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]

\\g<1> is a backreference to the (\\w+) in the first regex. \\g<1>是对第一个正则表达式中(\\ w +)的反向引用。 It lets you reuse what ever is in there. 它使您可以重复使用其中的任何内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM