[英]Format Text File in python
樣本文本文件:
["abc","123","apple","red","<a href='link1'>zzz</a>"],
["abc","124","orange","blue","<a href='link1'>zzz</a>"],
["abc","125","almond","black","<a href='link1'>zzz</a>"],
["abc","126","mango","pink","<a href='link1'>zzz</a>"]
預期產量:
abc 123 apple red 'link1'>zzz
abc 124 orange blue 'link1'>zzz
abc 125 almond black 'link1'>zzz
abc 126 mango pink 'link1'>zzz
我只希望文件沒有花括號,用空格隔開的逗號,並且僅獲取該行中最后一個元素的鏈接。
我嘗試在Python中使用列表。
我不知道該如何進行。 猜猜,我在某處出錯。 幫助將不勝感激。 提前致謝 :)
import sys
import re
Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]
for EachLine in Lines:
Parts = EachLine.split(",")
for EachPart in Parts:
EachPart = re.sub(r'[', '', EachPart)
EachPart = re.sub(r']', '', EachPart)
可以使用以下腳本完成此操作:
import csv
import re
with open('input.txt', 'r') as f_input, open('output.txt', 'w') as f_output:
csv_input = csv.reader(f_input, delimiter='"')
for cols in csv_input:
if cols:
cols = [x for x in cols[1:-1:2]]
link = re.search(r"('.*?)<", cols[-1])
if link:
cols[-1] = link.group(1)
f_output.write('{}\n'.format(' '.join(cols)))
這將為您提供包含以下內容的output.txt
:
abc 123 apple red 'link1'>zzz
abc 124 orange blue 'link1'>zzz
abc 125 almond black 'link1'>zzz
abc 126 mango pink 'link1'>zzz
更新 -此代碼的簡化版本在repl.it上運行,以顯示正確的輸出。 輸入來自字符串,並顯示輸出。 只需單擊Run
按鈕。
更新 -更新以跳過空白行
無需使用regex to remove []
碼:
import ast
with open("check.txt") as inp:
for line in inp:
check=ast.literal_eval(line.strip().strip(","))
print " ".join(check)
輸出:
abc 123 apple red <a href='link1'</a>
abc 124 orange blue <a href='link2'</a>
abc 125 almond black <a href='link3'</a>
abc 126 mango pink <a href='link4'</a>
但是為了只獲得href的價值,我使用了regex
代碼1:
import re
import ast
with open("check.txt") as inp:
for line in inp:
check=ast.literal_eval(line.strip().strip(","))
if re.search("'([^']*?)'",check[4]):
check[4]=re.search("'([^']*?)'",check[4]).group(1)
print " ".join(check)
輸出:
abc 123 apple red link1
abc 124 orange blue link2
abc 125 almond black link3
abc 126 mango pink link4
根據您的要求
a="<a href='link1'>zzz</a>"
print re.search("'([^<]*?)<",a).group(1)
輸出:
link1'>zzz
代碼2:
import re
import ast
with open("check.txt") as inp:
for line in inp:
check=ast.literal_eval(line.strip().strip(","))
if re.search("'([^<]*?)<",a):
check[4]=re.search("'([^<]*?)<",a).group(1)
print " ".join(check)
由於您的數據是有效的python數據結構,因此可以使用ast.literal_eval
進行ast.literal_eval
:
>>> import ast
>>> ast.literal_eval('''["abc","123","apple","red","<a href='link1'</a>"]''')
['abc', '123', 'apple', 'red', "<a href='link1'</a>"]
您還可以通過將第9個字符之后到第5個字符之間的所有內容取為字符串,從而將鏈接從字符串中切出:
>>> s = "<a href='link1'</a>"
>>> s[9:-5]
'link1'
把它放在一起:
with open(outfile, 'w') as output:
with open(filename) as lines:
for line in lines:
values = ast.literal_eval(line)
values[4] = values[4][9:-5]
output.write(' '.join(values))
每行可以按以下方式處理:
>>>line = ["abc","123","apple","red","<a href='link1'>zzz</a>"]
>>>' '.join([k if 'href=' not in k else k[9:-4] for k in line])
"abc 123 apple red link1'>zzz"
在文件內容周圍添加方括號,您將擁有一個有效的JSON對象:
import json
with open(filename) as lines:
output = json.loads("[" + lines.read() + "]")
現在,您可以處理線條,例如,刪除鏈接周圍的錨點:
import re
for line in output:
line[4] = re.search(r"'([^']*)'", line[4]).group(1)
print " ".join(line)
那這段代碼呢
from __future__ import print_function, unicode_literals
import ast
import io
import re
import traceback
input_str = """["abc","123","apple","red","<a href='link1'</a>"],
["abc","124","orange","blue","<a href='link2'</a>"],
["abc","125","almond","black","<a href='link3'</a>"],
["abc","126","mango","pink","<a href='link4'</a>"]"""
filelikeobj = io.StringIO(input_str)
for line in filelikeobj:
line = line.strip().rstrip(",")
if line:
try:
line_list = ast.literal_eval(line)
except SyntaxError:
traceback.print_exc()
continue
for li in line_list[:-1]:
print(li, end=" ")
s = re.search("href\s*=\s*['\"](.*)['\"]", line_list[-1], re.I)
if s:
print(s.group(1), end="")
print()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.