简体   繁体   English

在Python中将杂乱的数据文件清理为更具可读性的格式?

[英]Cleaning up a messy data file to a more readable format in Python?

I have a text file (heavily modified for this example) which has some data that I want to extract and do some calculations with it. 我有一个文本文件(在此示例中进行了大幅修改),其中包含一些我要提取的数据并对其进行一些计算。 However the text file is extremely messy, so I'm trying to clean it up and write it out to new files first. 但是,文本文件非常混乱,因此我试图对其进行清理,然后将其首先写到新文件中。

Here is the .txt file I'm working with: http://textuploader.com/5elql 这是我正在使用的.txt文件: http : //textuploader.com/5elql

I am trying to extract the data which is under the titles (called “Important title”). 我正在尝试提取标题(称为“重要标题”)下的数据。 The only possible way to do that is to first locate a string which always occurs in the file, and its called “DATASET” because all the mess above and below the important data will cover an arbitrary number of lines, difficult to remove manually. 唯一可行的方法是首先找到一个始终出现在文件中的字符串,并将其称为“ DATASET”,因为重要数据上下的所有混乱将覆盖任意数量的行,很难手动删除。 Once that's done I want to store the data in separate files so that it is easier to analyse like this: 完成此操作后,我想将数据存储在单独的文件中,以便更容易进行如下分析:

http://textuploader.com/5elqw http://textuploader.com/5elqw

The file names will be concatenated with the title + the date. 文件名将与标题和日期连接在一起。

Here is what I have tried so far 到目前为止,这是我尝试过的

with open("example.txt") as file:
    for line in file:
        if line.startswith('DATASET:'):
            fileTitle = line[9:]
        if line.startswith("DATE:"):
            fileDate = line[:]
            print(fileTitle+fileDate)

OUTPUT 输出值

IMPORTANT TITLE 1
DATE: 12/30/2015

IMPORTANT TITLE 2
DATE: 01/03/2016

So it appears my loop manages to locate the lines where the titles inside the file are and print them out. 因此,看来我的循环设法找到文件内标题所在的行并打印出来。 But this is where I run out of steam. 但是,这是我筋疲力尽的地方。 I have no idea on how to extract the data under those titles from there onwards. 我不知道如何从那里提取那些标题下的数据。 I have tried using file.readlines() but it outputs all the mess that is in between Important Title 1 and Important Title 2. 我尝试使用file.readlines(),但它输出介于重要标题1和重要标题2之间的所有混乱信息。

Any advice on how I can read all the data under the titles and output them into separate files? 关于如何读取标题下的所有数据并将其输出到单独文件的任何建议? Thanks for your time. 谢谢你的时间。

You could use regex. 您可以使用正则表达式。

import re

pattern = r"(\s+X\s+Y\s*)|(\s*\d+\s+\d+\s*)"
prog = re.compile(pattern)

with open("example.txt") as file:
cur_filename = ''
content = ""
for line in file:
    if line.startswith('DATASET:'):
        fileTitle = line[9:]
    elif line.startswith("DATE:"):
        fileDate = line[6:]
        cur_filename = (fileTitle.strip() + fileDate.strip()).replace('/', '-')
        print(cur_filename)
        content_title = fileTitle + line
    elif prog.match(line):
        content += line
    elif cur_filename and content:
        with open(cur_filename, 'w') as fp:
            fp.write(content_title)
            fp.write(content)
        cur_filename = ''
        content = ''

I don't know exactly how you want to store your data but assuming you want a dictionary you could use regex to check if the incoming line matched the pattern, then because fileTitle isn't global you could use that as the key and add the values. 我不知道您要如何存储数据,但是假设您要使用字典,可以使用正则表达式检查传入的行是否与模式匹配,然后由于fileTitle不是全局的,因此可以将其用作键并添加价值观。 I also added rstrip('\\r\\n') to remove the newline characters after fileTitle. 我还添加了rstrip('\\r\\n')来删除fileTitle之后的换行符。

import re

#if you don't want to store the X and Y, just use re.compile('\d\s+\d+')
p = re.compile('(\d\s+\d+)|(X\s+Y)')
data={}
with open("input.txt") as file:
    for line in file:
        if line.startswith('DATASET:'):
            fileTitle = line[9:].rstrip('\r\n')
        if line.startswith("DATE:"):
            fileDate = line[:]
            print(fileTitle+fileDate)
        if p.match(line):
            if fileTitle not in data:
                data[fileTitle]=[]
            line=line.rstrip('\r\n')
            data[fileTitle].append(line.split('\t'))
            if len(data[fileTitle][len(data[fileTitle])-1]) == 3:
                data[fileTitle][len(data[fileTitle])-1].pop()

print data

Yet another regex solution: 另一个正则表达式解决方案:

sep = '*************************\n'

pattern = r'DATASET[^%]*'
good_stuff = re.compile(pattern)
pattern = r'^DATASET: (.*?)$'
title = re.compile(pattern, flags = re.MULTILINE)
pattern = r'^DATE: (.*?)$'
date = re.compile(pattern, flags = re.MULTILINE)

with open(r'foo.txt') as f:
    data = f.read()
for match in good_stuff.finditer(data):
    data = match.group()
    important_title = title.search(data).group(1)
    important_date = date.search(data).group(1)
    important_date = important_date.replace(r'/', '-')
    fname = important_title + important_date + '.txt'
    print(sep, fname)
    print(data)
    ##with open(fname, 'w') as f:
    ##    f.write(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM