简体   繁体   中英

How to read multiple texts files, where we read all text files only of same group?

I have several text files in my directory like this,

id-2020-01-21-22.txt
id-2020-01-21-23.txt
id-2020-01-22-00.txt
id-2020-01-22-01.txt
id-2020-01-22-02.txt
id-2020-01-23-00.txt
id-2020-01-24-00.txt

So how can i read them like where I read id-2020-01-21-22.txt & id-2020-01-21-23.txt together first, make them into a data frame, write them to a combined text file, then id-2020-01-22-00.txt & id-2020-01-22-01.txt & id-2020-01-22-02.txt all together, write them to a dataframe and so on until the last file in the directory.

inner structure of all the text file looks like so:

100232323\n
903812398\n
284934289\n
{empty line placeholder}

No heading, but each text file has an empty line at the end. I am new to python, appreciate if you can help me out.

This is how far I have gotten:

import os

new_list = []
for root, dirs, files in os.walk('./textFilesFolder'):
    for file in files:
        if file.endswith('.txt'):
            with open(os.path.join(root, file), 'r') as f:
                text = f.read()
                new_list.append(text)


print(new_list)

You want daily summaries where you catenate the hourly files together. Ok, good.

Create a Ymd date regex :

import re

date_re = re.compile(r'^id-(\d{4}-\d{2}-\d{2})-\d{2}\.txt$')
prev_date = None

Now in your loop you can replace the existing if with:

        m = date_re.search(file)
        if m:
            date = m.group(1)
            print(f'Working on day {date} ...')
            ...
            prev_date = date

Having parsed out the date, you can now notice when it changes, perhaps by comparing whether prev_date == date , and take appropriate action, like writing out to a new file.

Or consider using with open(f'output-{date}.txt', 'a') as fout: to let you append to a (potentially already existing) file. That way the filesystem is remembering things for you, rather than needing to keep track of more variables in your program.

BTW, your use of walk() is perfectly nice, kudos on that. But for this directory of files, the structure is simple enough that you could use glob :

new_list = []
for file in glob.glob('id-*.txt'):
    ...

EDIT

Suppose we start with a clear slate, no output files:

$ rm output-*.txt

Then we could just append in a loop, similar to $ cat hour01 hour02 > day31 . Or, same thing, similar to $ rm day31; cat hour01 >> day31; cat hour02 >> day31 $ rm day31; cat hour01 >> day31; cat hour02 >> day31 $ rm day31; cat hour01 >> day31; cat hour02 >> day31 .

        m = date_re.search(file)
        if m:
            date = m.group(1)
            print(f'Working on day {date} ...')
            with open(file) as fin:
                with open(f'output-{date}.txt', 'a') as fout:
                    fout.write(fin.read())

And that's it, you're done, We read the hourly text. and write it to the end of the daily file.

I mentioned the rm above because, if you're debugging and you run this twice or N times, you'll wind up with an output file N times bigger than you were hoping for.

You can also try to do it like this for readability.

from collections import defaultdict
import os
import pandas as pd

data = defaultdict(list)
for i in (os.listdir('files/')): # here files is a folder in current directory.
    print(i)                     # which has your text files.
    column = i.split('-')[3]
    with open('files/'+i, 'r') as f:
        file_data = f.read().replace('\n', ' ').split(' ')
        data[column].extend(file_data[:-1])
df = pd.DataFrame(data)
print('---')
print(df)

Output:

id-2020-01-22-01.txt
id-2020-01-22-00.txt
id-2020-01-21-23.txt
id-2020-01-21-22.txt
---
          22          21
0    1006523  1002323212
1   90381122  9038123912
2   28493423   284934212
3  100232323   100232323
4  903812332   903812392
5  284934212   284934289

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM