在python中格式化文本文件

Question

Sample Text File: 样本文本文件：

["abc","123","apple","red","<a href='link1'>zzz</a>"],

["abc","124","orange","blue","<a href='link1'>zzz</a>"],

["abc","125","almond","black","<a href='link1'>zzz</a>"],

["abc","126","mango","pink","<a href='link1'>zzz</a>"]

Expected Output: 预期产量：

abc 123 apple red 'link1'>zzz

abc 124 orange blue 'link1'>zzz

abc 125 almond black 'link1'>zzz

abc 126 mango pink 'link1'>zzz

I just want the file to be free of braces, commas separated by white spaces, and obtain only the link of the last element in the line. 我只希望文件没有花括号，用空格隔开的逗号，并且仅获取该行中最后一个元素的链接。

I tried using Lists in Python. 我尝试在Python中使用列表。

I dont know how to proceed. 我不知道该如何进行。 Guess, I am going wrong somewhere. 猜猜，我在某处出错。 Help would be appreciated. 帮助将不胜感激。 Thanks in advance :) 提前致谢：）

import sys
import re

Lines = [Line.strip() for Line in open (sys.argv[1],'r').readlines()]



for EachLine in Lines:
    Parts = EachLine.split(",")
    for EachPart in Parts:

        EachPart = re.sub(r'[', '', EachPart)
        EachPart = re.sub(r']', '', EachPart)

Answer 1

If you plan to remove [ and ] with a regex, you need to escape the square brackets to match them as literal symbols. 如果您打算使用正则表达式删除[和] ，则需要对方括号进行转义以将其作为文字符号进行匹配。 They are "special" regex characters denoting the character class boundaries and thus, need special treatment. 它们是表示字符类边界的“特殊”正则表达式字符，因此需要特殊对待。

Here is a sample regex replacement: 这是一个正则表达式替换示例：

EachPart = re.sub(r'[\[\]]', '', EachPart)

See demo 观看演示

However, you can remove them with str.replace(old, new[, max]) that does not require a regex: 但是，您可以使用不需要正则表达式的str.replace(old, new[, max])删除它们：

EachPart = EachPart.replace('[', '').replace(']', '')

See demo 观看演示

Answer 2

This could be done using the following script: 可以使用以下脚本完成此操作：

import csv
import re

with open('input.txt', 'r') as f_input, open('output.txt', 'w') as f_output:
    csv_input = csv.reader(f_input, delimiter='"')
    for cols in csv_input:
        if cols:
            cols = [x for x in cols[1:-1:2]]
            link = re.search(r"('.*?)<", cols[-1])
            if link:
                cols[-1] = link.group(1)

            f_output.write('{}\n'.format(' '.join(cols)))

This will give you output.txt containing: 这将为您提供包含以下内容的output.txt ：

abc 123 apple red 'link1'>zzz
abc 124 orange blue 'link1'>zzz
abc 125 almond black 'link1'>zzz
abc 126 mango pink 'link1'>zzz

Update - There is a simplified version of this code running here on repl.it to show the correct output. 更新 -此代码的简化版本在repl.it上运行，以显示正确的输出。 Input comes from a string, and output is displayed. 输入来自字符串，并显示输出。 Just click the Run button. 只需单击Run按钮。

Update - Updated to skip blank lines 更新 -更新以跳过空白行

Answer 3

There is no need to use regex to remove [] 无需使用regex to remove []

Code: 码：

import ast
with open("check.txt") as inp:
    for line in inp:
        check=ast.literal_eval(line.strip().strip(","))        
        print " ".join(check)

Output: 输出：

abc 123 apple red <a href='link1'</a>
abc 124 orange blue <a href='link2'</a>
abc 125 almond black <a href='link3'</a>
abc 126 mango pink <a href='link4'</a>

But to get only the value of href I used regex 但是为了只获得href的价值，我使用了regex

Code1: 代码1：

import re
import ast
with open("check.txt") as inp:
    for line in inp:
        check=ast.literal_eval(line.strip().strip(",")) 
        if re.search("'([^']*?)'",check[4]):
            check[4]=re.search("'([^']*?)'",check[4]).group(1)
        print " ".join(check)

output: 输出：

abc 123 apple red link1
abc 124 orange blue link2
abc 125 almond black link3
abc 126 mango pink link4

As per you requirement 根据您的要求

 a="<a href='link1'>zzz</a>"
 print re.search("'([^<]*?)<",a).group(1)

output: 输出：

link1'>zzz

Code2: 代码2：

import re
import ast
with open("check.txt") as inp:
    for line in inp:
        check=ast.literal_eval(line.strip().strip(",")) 
        if re.search("'([^<]*?)<",a):
            check[4]=re.search("'([^<]*?)<",a).group(1)
        print " ".join(check)

Answer 4

Since your data is valid python data structures you can read it in using ast.literal_eval : 由于您的数据是有效的python数据结构，因此可以使用ast.literal_eval进行ast.literal_eval ：

>>> import ast
>>> ast.literal_eval('''["abc","123","apple","red","<a href='link1'</a>"]''')
['abc', '123', 'apple', 'red', "<a href='link1'</a>"]

You can also slice the link out of the string by taking everything after the 9th character and up until the 5th to last: 您还可以通过将第9个字符之后到第5个字符之间的所有内容取为字符串，从而将链接从字符串中切出：

>>> s = "<a href='link1'</a>"
>>> s[9:-5]
'link1'

Putting it together: 把它放在一起：

with open(outfile, 'w') as output:
    with open(filename) as lines:
        for line in lines:
            values = ast.literal_eval(line)
            values[4] = values[4][9:-5]
            output.write(' '.join(values))

Answer 5

Each line may be processed as follows: 每行可以按以下方式处理：

>>>line = ["abc","123","apple","red","<a href='link1'>zzz</a>"]

>>>' '.join([k if 'href=' not in k else k[9:-4] for k in line])
"abc 123 apple red link1'>zzz"

Answer 6

Add brackets around the file's contents and you have a valid JSON object: 在文件内容周围添加方括号，您将拥有一个有效的JSON对象：

import json
with open(filename) as lines:
    output = json.loads("[" + lines.read() + "]")

Now you can process the lines, for example removing the anchor around the link: 现在，您可以处理线条，例如，删除链接周围的锚点：

import re
for line in output:
    line[4] = re.search(r"'([^']*)'", line[4]).group(1)
    print " ".join(line)

Answer 7

What about this code 那这段代码呢

from __future__ import print_function, unicode_literals
import ast
import io
import re
import traceback

input_str = """["abc","123","apple","red","<a href='link1'</a>"],

["abc","124","orange","blue","<a href='link2'</a>"],

["abc","125","almond","black","<a href='link3'</a>"],

["abc","126","mango","pink","<a href='link4'</a>"]"""

filelikeobj = io.StringIO(input_str)

for line in filelikeobj:
    line = line.strip().rstrip(",")
    if line:
        try:
            line_list = ast.literal_eval(line)
        except SyntaxError:
            traceback.print_exc()
            continue
        for li in line_list[:-1]:
            print(li, end=" ")

        s = re.search("href\s*=\s*['\"](.*)['\"]", line_list[-1], re.I)
        if s:
            print(s.group(1), end="")
        print()

在python中格式化文本文件

问题描述

7 个解决方案

解决方案1
2 2015-09-09 07:40:24

解决方案2
2 已采纳 2015-09-09 07:49:37

解决方案3
1 2015-09-09 07:44:02

解决方案4
1 2015-09-09 07:45:12

解决方案5
1 2015-09-09 07:48:51

解决方案6
0 2015-09-09 07:54:31

解决方案7
0 2015-09-09 08:07:08

在python中格式化文本文件

问题描述

7 个解决方案

解决方案1 2 2015-09-09 07:40:24

解决方案2 2 已采纳 2015-09-09 07:49:37

解决方案3 1 2015-09-09 07:44:02

解决方案4 1 2015-09-09 07:45:12

解决方案5 1 2015-09-09 07:48:51

解决方案6 0 2015-09-09 07:54:31

解决方案7 0 2015-09-09 08:07:08

解决方案1
2 2015-09-09 07:40:24

解决方案2
2 已采纳 2015-09-09 07:49:37

解决方案3
1 2015-09-09 07:44:02

解决方案4
1 2015-09-09 07:45:12

解决方案5
1 2015-09-09 07:48:51

解决方案6
0 2015-09-09 07:54:31

解决方案7
0 2015-09-09 08:07:08