简体   繁体   English

使用 pysftp 拆分 SFTP 目录中的文本文件

[英]Using pysftp to split text file in SFTP directory

I'm trying to split text file of size 100 MB (having unique rows) into 10 files of equal size using python pysftp but I'm unable to find proper approach for same.我正在尝试使用 python pysftp 将大小为 100 MB(具有唯一行)的文本文件拆分为 10 个大小相等的文件,但我无法找到合适的方法。

Please let me know how can I read/ split files from SFTP directory and place back all files to FTP directory itself.请让我知道如何从 SFTP 目录读取/拆分文件并将所有文件放回 FTP 目录本身。

with pysftp.Connection(host=sftphostname, username=sftpusername, port=sftpport, private_key=sftpkeypath) as sftp:
    with sftp.open(source_filedir+source_filename) as file:
        for line in file:

<....................Unable to decide logic------------------>

The logic you probably need is as follows:您可能需要的逻辑如下:

  1. As you are in a read only environment, you will need to download the whole file into memory.由于您处于只读环境中,因此您需要将整个文件下载到 memory 中。

  2. Use Python's io.StringIO() to handle the data in memory as if it is a file.使用 Python 的io.StringIO()将 memory 中的数据当作文件处理。

  3. As you are talking about rows, I assume you mean the file is in CSV format?当您谈论行时,我假设您的意思是该文件采用 CSV 格式? You can make use of Python's csv library to parse the file.您可以使用 Python 的csv库来解析文件。

  4. First do a quick scan of the file using a csv.reader() , use this to count the number of rows in the file.首先使用csv.reader()快速扫描文件,使用它来计算文件中的行数。 This can then be used to determine how to split the file into equal number of rows, rather than just splitting the file at set byte counts.然后,这可用于确定如何将文件拆分为相等数量的行,而不是仅以设置的字节数拆分文件。

  5. Once you know the number of rows, reopen the data (as a file again) and just read the header row in. This can then be added to the first row of each split file you create.知道行数后,重新打开数据(再次作为文件)并读取 header 行。然后可以将其添加到您创建的每个拆分文件的第一行。

  6. Now read n rows in (based on your total row count).现在读取n行(基于您的总行数)。 Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. This can then be used to upload using pysftp to a new file on the server, all without requiring access to an actual filing system.然后可以使用pysftp将其上传到服务器上的新文件,而无需访问实际的文件系统。

The result will be that each file will also have a valid header row.结果将是每个文件也将有一个有效的 header 行。

I don't think FTP / SFTP allow for something more clever than simply downloading the file.我不认为 FTP / SFTP 允许比简单地下载文件更聪明的东西。 Meaning, you'd have to get the whole file, split it locally, then put the new files back.这意味着,您必须获取整个文件,在本地拆分它,然后将新文件放回原处。

For text file splitting logic I believe that this thread may be of use: Split large files using python对于文本文件拆分逻辑,我相信这个线程可能有用: Split large files using python

There is a library like filesplit you can use to split files.有一个类似于filesplit的库,您可以使用它来拆分文件。 It has similar functionality like the Linux command split or csplit .它具有类似于 Linux 命令splitcsplit的功能。

For you case对于你的情况

split text file of size 100 MB into 10 files of equal size将大小为 100 MB 的文本文件拆分为10 个大小相同的文件

you can use method bysize :您可以使用方法bysize

import os
from filesplit.split import Split

infile = source_filedir + source_filename
outdir = source_filedir
split = Split(infile, outdir)  # construct the splitter


file_size = os.path.getsize(infile)
desired_parts = 10
bytes_per_split =  file_size / desired_parts  # have to calculate the size 

split.bysize(bytes_per_split)

For a line-partitioned split use bylinecount :对于行分区拆分使用bylinecount

from filesplit.split import Split

split = Split(infile, outdir)
split.bylinecount(1_000_000)  # for a million lines each file 

See also:也可以看看:

Bonus奖金

Since Python 3.6 you can use underscores in numeric literals (see PEP515): million = 1_000_000 to improve readability,从 Python 3.6 开始,您可以在数字文字中使用下划线(请参阅 PEP515): million = 1_000_000以提高可读性,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM