简体   繁体   English

如何读取多个文本文件,我们只读取同一组的所有文本文件?

[英]How to read multiple texts files, where we read all text files only of same group?

I have several text files in my directory like this,我的目录中有几个这样的文本文件,

id-2020-01-21-22.txt
id-2020-01-21-23.txt
id-2020-01-22-00.txt
id-2020-01-22-01.txt
id-2020-01-22-02.txt
id-2020-01-23-00.txt
id-2020-01-24-00.txt

So how can i read them like where I read id-2020-01-21-22.txt & id-2020-01-21-23.txt together first, make them into a data frame, write them to a combined text file, then id-2020-01-22-00.txt & id-2020-01-22-01.txt & id-2020-01-22-02.txt all together, write them to a dataframe and so on until the last file in the directory.那么我如何阅读它们,就像我首先一起阅读id-2020-01-21-22.txtid-2020-01-21-23.txt ,将它们制成数据框,将它们写入组合文本文件,然后将id-2020-01-22-00.txt & id-2020-01-22-01.txt & id-2020-01-22-02.txt一起写入 dataframe 等等,直到目录中的最后一个文件。

inner structure of all the text file looks like so:所有文本文件的内部结构如下所示:

100232323\n
903812398\n
284934289\n
{empty line placeholder}

No heading, but each text file has an empty line at the end.没有标题,但每个文本文件的末尾都有一个空行。 I am new to python, appreciate if you can help me out.我是 python 的新手,如果你能帮助我,不胜感激。

This is how far I have gotten:这是我已经走了多远:

import os

new_list = []
for root, dirs, files in os.walk('./textFilesFolder'):
    for file in files:
        if file.endswith('.txt'):
            with open(os.path.join(root, file), 'r') as f:
                text = f.read()
                new_list.append(text)


print(new_list)

You want daily summaries where you catenate the hourly files together.您需要将每小时文件连接在一起的每日摘要。 Ok, good.好的,很好。

Create a Ymd date regex :创建一个 Ymd 日期正则表达式

import re

date_re = re.compile(r'^id-(\d{4}-\d{2}-\d{2})-\d{2}\.txt$')
prev_date = None

Now in your loop you can replace the existing if with:现在在您的循环中,您可以将现有的if替换为:

        m = date_re.search(file)
        if m:
            date = m.group(1)
            print(f'Working on day {date} ...')
            ...
            prev_date = date

Having parsed out the date, you can now notice when it changes, perhaps by comparing whether prev_date == date , and take appropriate action, like writing out to a new file.解析出日期后,您现在可以通过比较是否prev_date == date来注意到它何时发生变化,并采取适当的措施,例如写入新文件。

Or consider using with open(f'output-{date}.txt', 'a') as fout: to let you append to a (potentially already existing) file.或者考虑使用with open(f'output-{date}.txt', 'a') as fout:让你 append 到一个(可能已经存在的)文件。 That way the filesystem is remembering things for you, rather than needing to keep track of more variables in your program.这样文件系统就会为您记住事情,而不是需要跟踪程序中的更多变量。

BTW, your use of walk() is perfectly nice, kudos on that.顺便说一句,您对walk()的使用非常好,对此表示赞赏。 But for this directory of files, the structure is simple enough that you could use glob :但是对于这个文件目录,结构很简单,您可以使用glob

new_list = []
for file in glob.glob('id-*.txt'):
    ...

EDIT编辑

Suppose we start with a clear slate, no output files:假设我们从一个清晰的石板开始,没有 output 文件:

$ rm output-*.txt

Then we could just append in a loop, similar to $ cat hour01 hour02 > day31 .然后我们可以在一个循环中只使用 append ,类似于$ cat hour01 hour02 > day31 Or, same thing, similar to $ rm day31; cat hour01 >> day31; cat hour02 >> day31或者,同样的事情,类似于$ rm day31; cat hour01 >> day31; cat hour02 >> day31 $ rm day31; cat hour01 >> day31; cat hour02 >> day31 $ rm day31; cat hour01 >> day31; cat hour02 >> day31 . $ rm day31; cat hour01 >> day31; cat hour02 >> day31

        m = date_re.search(file)
        if m:
            date = m.group(1)
            print(f'Working on day {date} ...')
            with open(file) as fin:
                with open(f'output-{date}.txt', 'a') as fout:
                    fout.write(fin.read())

And that's it, you're done, We read the hourly text.就是这样,你完成了,我们阅读每小时的文本。 and write it to the end of the daily file.并将其写入每日文件的末尾。

I mentioned the rm above because, if you're debugging and you run this twice or N times, you'll wind up with an output file N times bigger than you were hoping for.我提到上面的rm是因为,如果您正在调试并且运行两次或 N 次,您最终会得到一个 output 文件,该文件比您希望的大 N 倍。

You can also try to do it like this for readability.您也可以尝试这样做以提高可读性。

from collections import defaultdict
import os
import pandas as pd

data = defaultdict(list)
for i in (os.listdir('files/')): # here files is a folder in current directory.
    print(i)                     # which has your text files.
    column = i.split('-')[3]
    with open('files/'+i, 'r') as f:
        file_data = f.read().replace('\n', ' ').split(' ')
        data[column].extend(file_data[:-1])
df = pd.DataFrame(data)
print('---')
print(df)

Output: Output:

id-2020-01-22-01.txt
id-2020-01-22-00.txt
id-2020-01-21-23.txt
id-2020-01-21-22.txt
---
          22          21
0    1006523  1002323212
1   90381122  9038123912
2   28493423   284934212
3  100232323   100232323
4  903812332   903812392
5  284934212   284934289

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM