从 Blob 存储读取 CSV 文件到 pandas dataframe 并忽略来自源系统的分页行

Question

I have a task which is to read a csv file from blob storage for data manipulation, this is really easy to do:我的任务是从 blob 存储中读取 csv 文件以进行数据操作，这很容易做到：

import pandas as pd
from io import StringIO
blob_client_instance = blobService.get_blob_client(
    "testflorencia", "TakeUpStores.csv", snapshot=None)

downloaded_blob = blob_client_instance.download_blob()
blob = downloaded_blob.content_as_text(encoding=None)
df = pd.read_csv(StringIO(blob))
df

However I get this error:但是我收到此错误：

initial_value must be str or None, not bytes

I am not able to share the file here because its confidential, but what I did notice is that every 20 rows there is a special pagination row with a special character:我无法在此处共享该文件，因为它是机密文件，但我注意到每 20 行有一个带有特殊字符的特殊分页行：

 = 37.364.304;;;; --> special character not rendered by StackOverflow

How can I read this csv into pandas and ignore those rows?如何将此 csv 读入 pandas 并忽略这些行？

I also tried without encoding parameter and I got adifferent error我也试过没有编码参数，但我得到了另一个错误

'utf-8' codec can't decode byte 0xc3 in position 16515: invalid continuation byte

Answer 1

Filter out the special rows from the downloaded text, then feed it to Pandas.从下载的文本中过滤掉特殊行，然后将其提供给 Pandas。

# ...
blob = downloaded_blob.content_as_text(encoding=None)
lines = "\n".join(line for line in blob.splitlines() if not line.startswith(" = "))  # or whatever is the criteria for a special row
df = pd.read_csv(StringIO(blob))

Answer 2

If all your special pagination rows are starting with the same single character, then you can make use of the comment parameter :如果您所有的特殊分页行都以相同的单个字符开头，那么您可以使用comment参数：

comment str, optional注释str，可选

Indicates remainder of line should not be parsed.指示不应解析行的其余部分。 If found at the beginning of a line, the line will be ignored altogether.如果在一行的开头找到，则该行将被完全忽略。 This parameter must be a single character.此参数必须是单个字符。 Like empty lines (as long as skip_blank_lines=True ), fully commented lines are ignored by the parameter header but not by skiprows.与空行一样（只要skip_blank_lines=True ），参数 header 会忽略完全注释的行，但skirows 不会。 For example, if comment='#' , parsing #empty\na,b,c\n1,2,3 with header=0 will result in 'a,b,c' being treated as the header.例如，如果comment='#' ，解析带有header=0的#empty\na,b,c\n1,2,3将导致'a,b,c'被视为 header。

df = pd.read_csv(StringIO(blob), comment='=')

or depending on the first character of the pagination row:或取决于分页行的第一个字符：

df = pd.read_csv(StringIO(blob), comment=' ')

从 Blob 存储读取 CSV 文件到 pandas dataframe 并忽略来自源系统的分页行

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-09-08 10:00:07

解决方案2
1 已采纳 2022-09-09 12:56:15

从 Blob 存储读取 CSV 文件到 pandas dataframe 并忽略来自源系统的分页行

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-09-08 10:00:07

解决方案2 1 已采纳 2022-09-09 12:56:15

解决方案1
2 已采纳 2022-09-08 10:00:07

解决方案2
1 已采纳 2022-09-09 12:56:15