簡體   English   中英

Python:在列表中替換\\ n \\ r \\ t,但不包括以\\ n \\ n開頭並以\\ n \\ r \\ n \\ t結尾的

[英]Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

我的列表:

['\\ n \\ r \\ n \\ t本文是關於甜香蕉的。 有關香蕉植物所屬的屬,請參閱Musa(屬)。\\ n \\ r \\ n \\ t有關烹飪中使用的淀粉質香蕉,請參閱烹飪香蕉。 其他用途,請參見香蕉(消除歧義)\\ n \\ r \\ n \\ tMusa物種原產於熱帶Indomalaya和澳大利亞,很可能最早在巴布亞新幾內亞馴化了。\\ n \\ r \\ n \\ t 135個國家/地區。\\ n \\ n \\ n \\ r \\ n \\ t全球范圍內,“香蕉”和“大蕉”之間沒有明顯的區別。\\ n \\ n說明\\ n \\ r \\ n \\ t香蕉植物是最大的草本開花植物。\\ n \\ r \\ n \\ t香蕉植物的所有地上部分均來自通常稱為“球莖”的結構。\\ n \\ n詞源\\ n \\ r \\ n \\ t香蕉一詞被認為是西非的起源,可能來自沃洛夫(Wolof)單詞banaana,然后通過西班牙語或葡萄牙語傳遞給英語。\\ n']

示例代碼:

import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
    list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)

抓取網頁后,需要進行數據清理。 我想將所有的"\\r""\\n""\\t"都替換為"" ,但是我發現其中包含字幕,如果這樣做, 字幕和句子將混合在一起

每個字幕總是以\\n\\n開頭,以\\n\\r\\n\\t結尾,是否有可能我可以做一些區分它們的事情,例如\\aEtymology\\a 它不會,如果我更換工作\\n\\n\\n\\r\\n\\t分別以\\a第一因其他部分可能有這樣相同的元素\\n\\n\\r ,它將變得像\\a\\r 提前致謝!

途徑

  1. 將字幕替換為自定義字符串,列表中的<subtitles>
  2. 替換列表中的\\n\\r\\t
  3. 用實際的字幕替換自定義字符串

l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']

import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)

for x in regex:
    l = [r.replace(x,"<subtitles>") for r in l]

rep = ['\n','\t','\r']
for y in rep:
    l = [r.replace(y, '') for r in l]

for x in regex:
    l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)

產量

['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']

['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re    

print([re.sub(r'[\n\r\t]', '', c) for c in list])

我認為您可以使用正則表達式

您可以使用正則表達式來做到這一點:

import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]

\\g<1>是對第一個正則表達式中(\\ w +)的反向引用。 它使您可以重復使用其中的任何內容。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM