简体   繁体   English

从 Blob 存储读取 CSV 文件到 pandas dataframe 并忽略来自源系统的分页行

[英]Read CSV file from Blob Storage to pandas dataframe and ignore pagination rows from source system

I have a task which is to read a csv file from blob storage for data manipulation, this is really easy to do:我的任务是从 blob 存储中读取 csv 文件以进行数据操作,这很容易做到:

import pandas as pd
from io import StringIO
blob_client_instance = blobService.get_blob_client(
    "testflorencia", "TakeUpStores.csv", snapshot=None)

downloaded_blob = blob_client_instance.download_blob()
blob = downloaded_blob.content_as_text(encoding=None)
df = pd.read_csv(StringIO(blob))
df

However I get this error:但是我收到此错误:

initial_value must be str or None, not bytes

I am not able to share the file here because its confidential, but what I did notice is that every 20 rows there is a special pagination row with a special character:我无法在此处共享该文件,因为它是机密文件,但我注意到每 20 行有一个带有特殊字符的特殊分页行:

 = 37.364.304;;;; --> special character not rendered by StackOverflow

How can I read this csv into pandas and ignore those rows?如何将此 csv 读入 pandas 并忽略这些行?

I also tried without encoding parameter and I got adifferent error我也试过没有编码参数,但我得到了另一个错误

'utf-8' codec can't decode byte 0xc3 in position 16515: invalid continuation byte

Filter out the special rows from the downloaded text, then feed it to Pandas.从下载的文本中过滤掉特殊行,然后将其提供给 Pandas。

# ...
blob = downloaded_blob.content_as_text(encoding=None)
lines = "\n".join(line for line in blob.splitlines() if not line.startswith(" = "))  # or whatever is the criteria for a special row
df = pd.read_csv(StringIO(blob))

If all your special pagination rows are starting with the same single character, then you can make use of the comment parameter :如果您所有的特殊分页行都以相同的单个字符开头,那么您可以使用comment参数

comment str, optional注释str,可选

Indicates remainder of line should not be parsed.指示不应解析行的其余部分。 If found at the beginning of a line, the line will be ignored altogether.如果在一行的开头找到,则该行将被完全忽略。 This parameter must be a single character.此参数必须是单个字符。 Like empty lines (as long as skip_blank_lines=True ), fully commented lines are ignored by the parameter header but not by skiprows.与空行一样(只要skip_blank_lines=True ),参数 header 会忽略完全注释的行,但skirows 不会。 For example, if comment='#' , parsing #empty\na,b,c\n1,2,3 with header=0 will result in 'a,b,c' being treated as the header.例如,如果comment='#' ,解析带有header=0#empty\na,b,c\n1,2,3将导致'a,b,c'被视为 header。

df = pd.read_csv(StringIO(blob), comment='=')

or depending on the first character of the pagination row:或取决于分页行的第一个字符:

df = pd.read_csv(StringIO(blob), comment=' ')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM