繁体   English   中英

Python:在列表中替换\\ n \\ r \\ t,但不包括以\\ n \\ n开头并以\\ n \\ r \\ n \\ t结尾的

[英]Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

我的列表:

['\\ n \\ r \\ n \\ t本文是关于甜香蕉的。 有关香蕉植物所属的属,请参阅Musa(属)。\\ n \\ r \\ n \\ t有关烹饪中使用的淀粉质香蕉,请参阅烹饪香蕉。 其他用途,请参见香蕉(消除歧义)\\ n \\ r \\ n \\ tMusa物种原产于热带Indomalaya和澳大利亚,很可能最早在巴布亚新几内亚驯化了。\\ n \\ r \\ n \\ t 135个国家/地区。\\ n \\ n \\ n \\ r \\ n \\ t全球范围内,“香蕉”和“大蕉”之间没有明显的区别。\\ n \\ n说明\\ n \\ r \\ n \\ t香蕉植物是最大的草本开花植物。\\ n \\ r \\ n \\ t香蕉植物的所有地上部分均来自通常称为“球茎”的结构。\\ n \\ n词源\\ n \\ r \\ n \\ t香蕉一词被认为是西非的起源,可能来自沃洛夫(Wolof)单词banaana,然后通过西班牙语或葡萄牙语传递给英语。\\ n']

示例代码:

import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
    list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)

抓取网页后,需要进行数据清理。 我想将所有的"\\r""\\n""\\t"都替换为"" ,但是我发现其中包含字幕,如果这样做, 字幕和句子将混合在一起

每个字幕总是以\\n\\n开头,以\\n\\r\\n\\t结尾,是否有可能我可以做一些区分它们的事情,例如\\aEtymology\\a 它不会,如果我更换工作\\n\\n\\n\\r\\n\\t分别以\\a第一因其他部分可能有这样相同的元素\\n\\n\\r ,它将变得像\\a\\r 提前致谢!

途径

  1. 将字幕替换为自定义字符串,列表中的<subtitles>
  2. 替换列表中的\\n\\r\\t
  3. 用实际的字幕替换自定义字符串

l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']

import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)

for x in regex:
    l = [r.replace(x,"<subtitles>") for r in l]

rep = ['\n','\t','\r']
for y in rep:
    l = [r.replace(y, '') for r in l]

for x in regex:
    l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)

产量

['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']

['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re    

print([re.sub(r'[\n\r\t]', '', c) for c in list])

我认为您可以使用正则表达式

您可以使用正则表达式来做到这一点:

import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]

\\g<1>是对第一个正则表达式中(\\ w +)的反向引用。 它使您可以重复使用其中的任何内容。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM