hadoop 文件系统打开文件并跳过第一行

Question

I'm reading the file in my HDFS using Python language.我正在使用 Python 语言读取 HDFS 中的文件。

Each file has a header and I'm trying to merge the files.每个文件都有一个标题，我正在尝试合并这些文件。 However, the header in each file also gets merged.但是，每个文件中的标头也会合并。

Is there a way to skip the header from second file?有没有办法跳过第二个文件的标题？

hadoop = sc._jvm.org.apache.hadoop
conf = hadoop.conf.Configuration()
fs = hadoop.fs.FileSystem.get(conf)

src_dir = "/mnt/test/"
out_stream = fs.create(hadoop.fs.Path(dst_file), overwrite)

files = []
for f in fs.listStatus(hadoop.fs.Path(src_dir)):
  if f.isFile():
    files.append(f.getPath())

for file in files:
  in_stream = fs.open(file)
  hadoop.io.IOUtils.copyBytes(in_stream, out_stream, conf, False)

Currently I have solved the problem with below logic, however would like to know if there is any better and efficient solution?目前我已经用下面的逻辑解决了这个问题，但是想知道是否有更好更有效的解决方案？ appreciate your help感谢你的帮助

for idx,file in enumerate(files):
            if debug: 
                print("Appending file {} into {}".format(file, dst_file))

            # remove header from the second file
            if idx>0:
              file_str = ""
              with open('/'+str(file).replace(':',''),'r+') as f:
                for idx,line in enumerate(f):
                  if idx>0:
                    file_str = file_str + line

              with open('/'+str(file).replace(':',''), "w+") as f:
                f.write(file_str)
            in_stream = fs.open(file)   # InputStream object and copy the stream
            try:
                hadoop.io.IOUtils.copyBytes(in_stream, out_stream, conf, False)     # False means don't close out_stream
            finally:
                in_stream.close()

Answer 1

What you are doing now is appending repeatedly to a string.您现在正在做的是重复附加到一个字符串。 This is a fairly slow process.这是一个相当缓慢的过程。 Why not write directly to the output file as you are reading?为什么不在阅读时直接写入输出文件？

for file_idx, file in enumerate(files):
  with open(...) as out_f, open(...) as in_f:
    for line_num, line in enumerate(in_f):
      if file_idx == 0 or line_num > 0:
        f_out.write(line)

If you can load the file all at once, you can also skip the first line by using readline followed by readlines :如果您可以一次加载所有文件，您还可以使用readline后跟readlines来跳过第一行：

for file_idx, file in enumerate(files):
  with open(...) as out_f, open(...) as in_f:
    if file_idx != 0:
      f_in.readline()
    f_out.writelines(f_in.readlines())

hadoop 文件系统打开文件并跳过第一行

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-01-21 21:31:11

hadoop 文件系统打开文件并跳过第一行

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-01-21 21:31:11

解决方案1
1 已采纳 2020-01-21 21:31:11