简体   繁体   English

Go 覆盖 a.gz 文件并使用 Python 复制以“trex”开头的句子

[英]Go over a .gz file and copy the sentence that starts with “trex” using Python

I need to go over a.gz file with python and copy the sentence that starts with "trex".我需要使用 python 在 a.gz 文件上 go 并复制以“trex”开头的句子。 I'm quite new to python and Linux, so I am not sure if "trex" has any meaning;我对 python 和 Linux 很陌生,所以我不确定“trex”是否有任何含义; hence, I treated it like a string.因此,我把它当作一个字符串。

The first step I did was to copy the contents of the file into a variable:我做的第一步是将文件的内容复制到一个变量中:

file1=gzip.open(filepath, 'r')
result=file1.read() #this works
file1.close()

Then I tried a few things- converting it into a string, then splitting through "\n"然后我尝试了一些事情-将其转换为字符串,然后通过 "\n" 拆分

result=str(result)
r=result.split("\n")
print(r[0]) # but this prints the entire file! Not just the first line as I expected

I tried doing the same thing without converting result into a string but to no avail.我尝试做同样的事情而不将result转换为字符串,但无济于事。

Also, I tried copying it into a different file and then trying to find "trex" in a few methods:另外,我尝试将其复制到另一个文件中,然后尝试通过几种方法找到“trex”:

output= open('test_command',"w") #also tried with 'wb' 
output.write(result) #it writes only the first line into output
print(output) #only first line...
output.close()

I also tried我也试过

result=output.readlines() #.readlines() yields an error because it isn't recognized (same for .readline)

It seems the problem lies in copying the content of the.gz file with this method.看来问题在于用这种方法复制.gz文件的内容。

I tried copying one line at a time:我尝试一次复制一行:

output= open('test_command','w') #also tried with 'wb' and without converting 'result' to a string
for line in result:
    output.write(line)
output.close()

I also tried (without writing result to a new file)我也试过(没有将result写入新文件)

for line in result:
        if line[0:4] == 'trex' :
            print(line)

I tried a few combinations of these methods as well, but to not make this question any longer and more nagging I believe these will suffice.我也尝试了这些方法的一些组合,但为了不再让这个问题变得更烦人,我相信这些就足够了。

I think you missed to decode your binary file stream using the string.decode('ascii') method.我认为您错过了使用string.decode('ascii')方法解码二进制文件 stream 的机会。 Because gzipping and gunzipping a file works for me:因为 gzipping 和 gunzipping 文件对我有用:

import gzip

content = b"""
Lots of content with the word trex in here
trex at start
another trex inlined
trex the second time infront
...and the last line with trex in it
"""

gzip.open('gziptest.gz', 'wb').write(content)
file_content = gzip.open('gziptest.gz', 'rb').read().decode('ascii')

print ( "compressed file content:\n%s\n" % file_content )

trex = [ line for line in file_content.split("\n") if line[0:4] == 'trex' ]
print ( "trex: %s" % trex )

The above example writes a compressed multi line string to a file called gziptest.gz in the current directory.上面的示例将压缩的多行字符串写入当前目录中名为gziptest.gz的文件。 Then it reads that string again and uncompresses it.然后它再次读取该字符串并解压缩它。 At the end every line beginning with the word 'trex' is filtered and printed as list.最后,以单词'trex'开头的每一行都被过滤并打印为列表。
Output: Output:

compressed file content:

Lots of content with the word trex in here
trex at start
another trex inlined
trex the second time infront
...and the last line with trex in it

trex: ['trex at start', 'trex the second time infront']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM