I have several text files in my directory like this,
id-2020-01-21-22.txt
id-2020-01-21-23.txt
id-2020-01-22-00.txt
id-2020-01-22-01.txt
id-2020-01-22-02.txt
id-2020-01-23-00.txt
id-2020-01-24-00.txt
So how can i read them like where I read id-2020-01-21-22.txt
& id-2020-01-21-23.txt
together first, make them into a data frame, write them to a combined text file, then id-2020-01-22-00.txt
& id-2020-01-22-01.txt
& id-2020-01-22-02.txt
all together, write them to a dataframe and so on until the last file in the directory.
inner structure of all the text file looks like so:
100232323\n
903812398\n
284934289\n
{empty line placeholder}
No heading, but each text file has an empty line at the end. I am new to python, appreciate if you can help me out.
This is how far I have gotten:
import os
new_list = []
for root, dirs, files in os.walk('./textFilesFolder'):
for file in files:
if file.endswith('.txt'):
with open(os.path.join(root, file), 'r') as f:
text = f.read()
new_list.append(text)
print(new_list)
You want daily summaries where you catenate the hourly files together. Ok, good.
Create a Ymd date regex :
import re
date_re = re.compile(r'^id-(\d{4}-\d{2}-\d{2})-\d{2}\.txt$')
prev_date = None
Now in your loop you can replace the existing if
with:
m = date_re.search(file)
if m:
date = m.group(1)
print(f'Working on day {date} ...')
...
prev_date = date
Having parsed out the date, you can now notice when it changes, perhaps by comparing whether prev_date == date
, and take appropriate action, like writing out to a new file.
Or consider using with open(f'output-{date}.txt', 'a') as fout:
to let you append to a (potentially already existing) file. That way the filesystem is remembering things for you, rather than needing to keep track of more variables in your program.
BTW, your use of walk()
is perfectly nice, kudos on that. But for this directory of files, the structure is simple enough that you could use glob :
new_list = []
for file in glob.glob('id-*.txt'):
...
EDIT
Suppose we start with a clear slate, no output files:
$ rm output-*.txt
Then we could just append in a loop, similar to $ cat hour01 hour02 > day31
. Or, same thing, similar to $ rm day31; cat hour01 >> day31; cat hour02 >> day31
$ rm day31; cat hour01 >> day31; cat hour02 >> day31
$ rm day31; cat hour01 >> day31; cat hour02 >> day31
.
m = date_re.search(file)
if m:
date = m.group(1)
print(f'Working on day {date} ...')
with open(file) as fin:
with open(f'output-{date}.txt', 'a') as fout:
fout.write(fin.read())
And that's it, you're done, We read the hourly text. and write it to the end of the daily file.
I mentioned the rm
above because, if you're debugging and you run this twice or N times, you'll wind up with an output file N times bigger than you were hoping for.
You can also try to do it like this for readability.
from collections import defaultdict
import os
import pandas as pd
data = defaultdict(list)
for i in (os.listdir('files/')): # here files is a folder in current directory.
print(i) # which has your text files.
column = i.split('-')[3]
with open('files/'+i, 'r') as f:
file_data = f.read().replace('\n', ' ').split(' ')
data[column].extend(file_data[:-1])
df = pd.DataFrame(data)
print('---')
print(df)
Output:
id-2020-01-22-01.txt
id-2020-01-22-00.txt
id-2020-01-21-23.txt
id-2020-01-21-22.txt
---
22 21
0 1006523 1002323212
1 90381122 9038123912
2 28493423 284934212
3 100232323 100232323
4 903812332 903812392
5 284934212 284934289
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.