简体   繁体   English

在Python中删除特殊字符

[英]Removing special characters in Python

I am trying to remove special characters from a log file. 我正在尝试从日志文件中删除特殊字符。 These are two example rows: 这是两个示例行:

2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;
2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials

This the output after removing the special characters: 除去特殊字符后的输出:

2016.04.03  23  54  48.957  213.210.213.316  PDL3_SGW2  5F6DB03A    093E    0D414D9C    1   1   userId  1000    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   250                                                                     

2016.04.03  23  54  48.958  781.69.243.363  PDL3_SGW2   userId  1001    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   1   0xDC40001   Invalid credentials                                                                             

As you see in the second row of the ouput, the "userId" is positioned under column[6] instead of column [11]. 如您在输出的第二行中所见,“ userId”位于列[6]下而不是列[11]下。 Since data for column[06] to column[10] in the log file is missing.I want to handle this and write out the all the columns even when there is no data in the log file. 由于缺少日志文件中列[06]至列[10]的数据。即使日志文件中没有数据,我也要处理并写出所有列。

The output should be as following: 输出应如下所示:

2016.04.03  23  54  48.957  213.210.213.316  PDL3_SGW2  5F6DB03A    093E    0D414D9C    1   1   userId  1000    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   250                                                                     

2016.04.03  23  54  48.958  781.69.243.363  PDL3_SGW2                                           userId  1001    http    live.skysat.tv  cmdc    services    region  25351   lang    swe count   250 sort    2blogicalChannelNumber  101 0   1   0xDC40001   Invalid credentials                                                                             

This is my part of the code: 这是我的代码部分:

new_str = re.sub(r'[- - [ " / : ; & ? = % ~ + \n \]]', ' ', line)
text = new_str.rstrip().split()
writer.writerow(text)

This works for the two lines that you posted: 这适用于您发布的两行:

import re

lines = ["2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;",
         "2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials"]

def adjust_columns(list_of_lines):
    widest = [max(len(el) for el in column) for column in zip(*list_of_lines)]
    return [ " ".join("{{:<{}s}}".format(widest[i]).format(e)
             for i,e in enumerate(line)) for line in list_of_lines ]

r = re.compile('[ /:;&?=%~+-]')
list_of_lines = [[r.split(el) for el in line.split(';:;')] for line in lines]
list_of_columns = [  all(len(el) == len(col[0]) for el in col)
                     and  adjust_columns(col)
                     or   [" ".join(el) for el in col]
                     for col in zip(*list_of_lines) ]
text = "\n".join(adjust_columns(list(zip(*list_of_columns))))
print(text)

This assumes that ;:; 假设;:; is always the delimiter for the fields. 始终是字段的分隔符。 The code splits each line into the fields. 该代码将每一行拆分为多个字段。 Each field is then split again at the special characters. 然后,每个字段再次以特殊字符分割。 If each field in a column contains the same number of special characters, the subfields in that column are are adjusted for their width and joined by whitespace. 如果一列中的每个字段包含相同数量的特殊字符,则将调整该列中的子字段的宽度,并用空格将其连接起来。 The last step is to adjust the width of each column. 最后一步是调整每列的宽度。

One problem could be, that you can't process the input line by line anymore, because you have to find the longest entry for each column. 一个问题可能是,您不能再逐行处理输入,因为您必须找到每一列的最长条目。

If you do not need the subfields to be adjusted (like in your example), you can use this simpler code: 如果不需要调整子字段(例如您的示例),则可以使用以下更简单的代码:

r = re.compile('[ /:;&?=%~+-]')
list_of_lines = [[" ".join(r.split(el)) for el in line.split(';:;')] for line in lines]
text = "\n".join(adjust_columns(list_of_lines))
>>> from pprint import pprint

Let's simulate the data file using a list of strings... 让我们使用字符串列表来模拟数据文件...

>>> lines = [
    '2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;',
    '2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials']

From the official docs, you can use the string method S.split(sep) that returns a list of the words in S , using sep as the delimiter string (emphasis is mine). 从官方文档中,您可以使用字符串方法S.split(sep) ,它使用sep作为分隔符字符串 (强调是我的),返回S中的单词列表。

In your case the delimiter is the string ';:;' 在您的情况下,分隔符是字符串';:;' , so you can do ,所以你可以做

>>> data = [line.split(';:;') for line in lines]

data is now a list of lists, each sublist contains empty strings for the missing fields in your file. data现在是列表列表,每个子列表都包含文件中缺少字段的空字符串。

>>> pprint(data)
[['2016.04.03 23:54:28.257',
  '213.210.213.316',
  'PDL3_SGW2',
  '5F6DBA-093E-0D4D9C-00000001-01',
  'userId',
  '',
  '1000',
  'http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber',
  '101',
  '0',
  '250',
  '',
  ''],
 ['2016.04.03 23:54:28.258',
  '781.69.243.363',
  'PDL3_SGW2',
  '',
  'userId',
  '',
  '1001',
  'http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber',
  '101',
  '0',
  '1',
  '0x40001',
  'Invalid credentials']]

And you can loop over data and output each set of fields in the way you like best, eg, 然后,您可以以最喜欢的方式遍历数据并输出每组字段,例如,

>>> for record in data: output(record)
>>>

and that's all. 就这样。

ps output() is a function that you have to define, according to your needs . ps output()必须根据需要定义的函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM