简体   繁体   中英

I need to extract data from multiple .txt files and move them to an Excel file, using Python

The .txt file contains 68 lines. Line 68 has 5 pieces of data that I need to extract, but I have no idea how. I have about 20 .txt files, all of which need their line 68 read. I need all of the extracted data, however, to be dropped onto one excel file.

Here is what line 68 looks like:

Final graph has 1496 nodes and n50 of 53706, max 306216, total 5252643, using 384548/389191 reads

I basically need all those numbers.

Use the following to open the textfile:

f = open('filepath.txt', 'r')
for line in f:
    #do operations for each line in the textfile

Repeat for each text file you want to read

Here's a link to a python library for reading/writing to/from excel. You want to use xlwt, it sounds like

I like to use for tasks like this. Below is an example for one file. You should be able to extend this to multiple files. You didn't say exactly how you wanted to format the data in the spreadsheet, so I just created one row of headers, followed by one row of data (5 fields) for the file. This could be refined if I have more information about your project.

from openpyxl import Workbook
import re

wb = Workbook()
ws = wb.get_active_sheet()

# write column headers
ws.cell(row=0, column=0).value = 'nodes'
ws.cell(row=0, column=1).value = 'n50'
ws.cell(row=0, column=2).value = 'max'
ws.cell(row=0, column=3).value = 'total'
ws.cell(row=0, column=4).value = 'reads'

# open file and extract lines into list            
f = open("somedata.txt", "r")
lines = f.readlines()

# compile regex using named groups and apply regex to line 68
p = re.compile("^Final\sgraph\shas\s(?P<nodes>\d+)\snodes\sand\sn50\sof\s(?P<n50>\d+),\smax\s(?P<max>\d+),\stotal\s(?P<total>\d+),\susing\s(?P<reads>\d+\/\d+)\sreads$")
m = p.match(lines[67])

# if we have a match, then write the data to the spreadsheet
if (m):
    ws.cell(row=1, column=0).value = m.group('nodes')
    ws.cell(row=1, column=1).value = m.group('n50')
    ws.cell(row=1, column=2).value = m.group('max')
    ws.cell(row=1, column=3).value = m.group('total')
    ws.cell(row=1, column=4).value = m.group('reads')

wb.save('mydata.xlsx')

The following is somewhat less elegant but more transparent than David's, which relies on regex. It relies strongly on the particular formatting you've described. Also, it seems to me that there are actually 6 (not 5) variables you care about -- unless you can convert the ratio in reads into a decimal fraction.

You'll need to provide the correct list of file names in nameList (manually, if they aren't named in a convenient way).

Also, I do not output to an excel file but to csv. Of course, it's very straightforward to open a csv file in Excel, from which you can save as xls.

Edit in response to comment (05/19/13): including the full path is straightforward.

import csv
import string

# Make list of all 20 files like so:
nameList = ['/full/path/to/Log.txt', '/different/path/to/Log.txt', '/yet/another/path/to/Log.txt']

lineNum = 68

myCols = ['nodes','n50','max','total','reads1','reads2']
myData = []

for name in nameList:
    fi = open(name,"r")

    table = string.maketrans("","")

    # split line lineNum into list of strings
    strings = fi.readlines()[lineNum-1].split()

    # remove punctuation appropriately
    nodes = int(strings[3])
    n50 = int(strings[8].translate(table,string.punctuation))
    myMax = int(strings[10].translate(table,string.punctuation))
    total = int(strings[12].translate(table,string.punctuation))
    reads1 = int(strings[14].split('/')[0])
    reads2 = int(strings[14].split('/')[1])

    myData.append([nodes, n50, myMax, total, reads1, reads2])

# Write the data out to a new csv file
fileOut = "out.csv"
csvFileOut = open(fileOut,"w")
myWriter = csv.writer(csvFileOut)
myWriter.writerow(myCols)
for line in myData:
    myWriter.writerow(line)
csvFileOut.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM