简体   繁体   English

在 python 中读取制表符分隔值 txt 文件时遇到问题

[英]having trouble reading tab separated value txt file in python

I am trying to read a tab separated value txt file in python that I extracted from AWS storage.我正在尝试读取从 AWS 存储中提取的 python 中的制表符分隔值 txt 文件。 (credentials censored for AWS with XXX) (用 XXX 审查 AWS 的凭证)

import io
import pandas as pd
import boto3
import csv
from bioservices import UniProt
from sqlalchemy import create_engine
s3 = boto3.resource(
    service_name='s3',
    region_name='us-east-2',
    aws_access_key_id='XXX',
    aws_secret_access_key='XXX'
)

so thats simply for connecting to AWS.所以这只是为了连接到 AWS。 next when I run this code for reading a tab separated txt file that is stored in AWS接下来,当我运行此代码以读取存储在 AWS 中的制表符分隔的 txt 文件时

txt = s3.Bucket('compound-bioactivity-original-files').Object('helper-files/kinhub_human_kinase_list_30092021.txt').get()
txt_reader = csv.reader(txt, delimiter='\t')
for line in txt_reader:
    print(line)

I get this output which is not what what I am looking for.我得到这个 output 这不是我想要的。 And using dialect='excel-tab' instead of delimiter='\t' gives me the same output as well并且使用 dialect='excel-tab' 而不是 delimiter='\t' 给我同样的 output

['ResponseMetadata']
['AcceptRanges']
['LastModified']
['ContentLength']
['ETag']
['VersionId']
['ContentType']
['Metadata']
['Body']

There are several issues with your code.您的代码有几个问题。

First, Object.get() does not return the contents of the Amazon S3 object.首先, Object.get()不返回 Amazon S3 object 的内容 Instead, as per the Object.get() documentation , it returns:相反,根据Object.get()文档,它返回:

{
    'Body': StreamingBody(),
    'AcceptRanges': 'string',
    'LastModified': datetime(2015, 1, 1),
    'ContentLength': 123,
    'ETag': 'string',
    'VersionId': 'string',
    'CacheControl': 'string',
    'ContentDisposition': 'string',
    ...
    'BucketKeyEnabled': True|False,
    'TagCount': 123,
}

You can see this happening by inserting print(txt) as a debugging line.您可以通过插入print(txt)作为调试行来看到这种情况。

If you want to access the contents of the object, you would use the Body element.如果您想访问 object 的内容,您将使用Body元素。 To retrieve the contents of the streaming body, you can use .read() .要检索流式主体的内容,您可以使用.read()

However, this comes back as a binary string since the object is treated as a binary file.但是,这会以二进制字符串的形式返回,因为 object 被视为二进制文件。 In Python, you can convert it back to ASCII by using .decode('ascii') .在 Python 中,您可以使用.decode('ascii')将其转换回 ASCII。 See: How to convert 'binary string' to normal string in Python3?请参阅:如何在 Python3 中将“二进制字符串”转换为普通字符串?

Therefore, you would actually need to use:因此,您实际上需要使用:

txt = s3.Bucket('bucketname').Object('object.txt').get()['Body'].read().decode('ascii')

(If that seems too complex, then you could have simply downloaded the file to the local disk, then use the CSV Reader on the file -- it would have worked nicely without having to use get/read/decode.) (如果这看起来太复杂,那么您可以简单地将文件下载到本地磁盘,然后在文件上使用 CSV 阅读器——它可以很好地工作而无需使用 get/read/decode。)

The next issue, is that the documentation for csv.reader says:下一个问题是csv.reader的文档说:

csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given csvfile.返回一个阅读器 object,它将遍历给定 csv 文件中的行。 csvfile can be any object which supports the iterator protocol and returns a string each time its next () method is called csvfile可以是任何支持迭代器协议的object,每次调用next ()方法时返回一个字符串

Since the decode() command returns a string , then the for loop will iterate over individual characters in the string , not lines within the string.由于decode()命令返回一个string ,因此for循环将遍历 string 中的单个字符而不是string 中的行。

Frankly, you could process the lines without using the CSV Reader, simply by splitting on the lines and the tabs, like this:坦率地说,您可以在不使用 CSV 阅读器的情况下处理这些行,只需拆分行和选项卡,如下所示:

import boto3

s3 = boto3.resource('s3')

txt = s3.Bucket('bucketname').Object('object.txt').get()['Body'].read().decode('ascii')

lines = txt.split('\n')

for line in lines:
    fields = line.split('\t')
    print(fields)

All of the above issues should have been noticeable by adding some debugging to see whether each step was returning the data that you expected, such as printing the contents of the variables after each step.通过添加一些调试来查看每个步骤是否返回了您期望的数据,例如在每个步骤之后打印变量的内容,上述所有问题应该已经很明显了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM