简体   繁体   English

使用 python 从非常大的文本文件中提取数据并导出为有组织的表格形式

[英]Pulling data from really large text file and exporting into organized tabular form using python

I'm taking my first programming class and I'm a little stuck on this.我正在进行我的第一次编程 class 并且我对此有点卡住了。 I know how to read lines and export data on a more simple level but I haven't done something in a less straightforward way like I have to with this...我知道如何在更简单的层面上读取行和导出数据,但我还没有以一种不太直接的方式做一些事情,就像我必须做的那样......

I need a bit of help getting this going.我需要一些帮助才能完成这项工作。

So I'm supposed to read and export data from this所以我应该从中读取和导出数据

https://pastebin.com/ZM8EPu0p https://pastebin.com/ZM8EPu0p

and export it into a more readable format like this- example output is below并将其导出为更易读的格式,例如 output 如下

https://imgur.com/F0rlK2c https://imgur.com/F0rlK2c

So far this is the code I have created in order to read and split this text into more usable chunks.到目前为止,这是我创建的代码,目的是读取文本并将其拆分为更多可用的块。

def readFile(filename):
        f = open(filename, "r") #opens the file in read mode
        mylist = f.read().splitlines() #puts the file into an array
        newlist= [word for line in mylist for word in line.split()] #line comprehension
        print(newlist) #print list
        
        f.close()
        return mylist
       
readFile("exactfilepath")

However, I am unsure of how to extract the data I need (defendant name, file number, courtroom, attorney, bond, charge, etc) and organize it into a more tabular format as shown in the example output.但是,我不确定如何提取我需要的数据(被告姓名、文件编号、法庭、律师、债券、指控等)并将其组织成更表格的格式,如示例 output 所示。

Sorry for the newbie question, I'm in the very early stages of learning python.对不起新手问题,我正处于学习 python 的早期阶段。

From what I can tell in the screenshot you posted, the entries start with a 4-digit number.从我在您发布的屏幕截图中可以看出,条目以 4 位数字开头。 While on this entry, write into a Python dict.在此条目上,写入 Python 字典。 When you encounter a new empty line, stop writing into the dict and put the dict's values into a csv row.当你遇到一个新的空行时,停止写入字典并将字典的值放入 csv 行。

So I think you should take a different approach.所以我认为你应该采取不同的方法。 I've written some code which may help, but you'll still need to fill in some parts yourself (the TODO).我已经编写了一些可能有帮助的代码,但是您仍然需要自己填写一些部分(TODO)。

It's also good practice to use the with keyword when reading from a file.从文件中读取时使用with关键字也是一种很好的做法。 It avoids having to use file.close() as long as you put all code that uses the file into an indentation.只要将使用文件的所有代码都放入缩进中,它就不必使用file.close() You can see it in action below.你可以在下面看到它的作用。 Also, you should use 4 spaces, not 8, for each indentation level in Python.此外,您应该为 Python 中的每个缩进级别使用 4 个空格,而不是 8 个空格。

def readFile(filename):
    csv_rows = []  # here we'll put each line after processing it
    with open(filename, 'r') as file:
        for original_line in file:
            if original_line[:4].isnumeric():  # if first 4 characters are digits
                entry={}  # we initialize an empty dictionary
            elif original_line.isspace() and entry:  #if we're on an empty line and the entry dict is not empty
                csv_rows.append(",".join(entry.values())) # write entry dict as comma-separated value and append it to csv_rows
                entry={}
            else:
                # read attributes into dict
                pairs =  # TODO: split data on each row into pairs of form "key: value"
                dict.update({pair.split(":")[0]:pair.split(":")[1] for pair in pairs}
    return csv_rows
       
readFile("exactfilepath")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM