简体   繁体   中英

How to parse a complex text file using Python string methods or regex and export into tabular form

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.) I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...

I am supposed to turn my text file from this

https://pastebin.com/ZM8EPu0p

and export it into a more readable format like this- example output is below

https://imgur.com/F0rlK2c

Here is what I have so far.

''' def readFile(court): csv_rows = [] #read and split txt file into pages & chunks of data by pagragraph with open(court, 'r') as file: data_chunks=file.read().split("\n\n")

    for chunk in data_chunks:
        chunk=chunk.strip #.strip removes useless spaces
        if str(data_chunks[:4]).isnumeric():  # if first 4 characters are digits
            entry= None  #initialize an empty dictionary
        elif str(data_chunks).isspace() and entry:  #if we're on an empty line and the entry dict is not empty
            csv_rows.DictWriter(dialect='excel') # turn csv_rows into needed output
            entry={}
        else:

            # parse here?
            

            print(data_chunks) 
            
return csv_rows
   

readFile(exactfilepath)

#end of code?

It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks. First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp

Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).

And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".

And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM