[英]Read CSV file from Blob Storage to pandas dataframe and ignore pagination rows from source system
I have a task which is to read a csv file from blob storage for data manipulation, this is really easy to do:我的任务是从 blob 存储中读取 csv 文件以进行数据操作,这很容易做到:
import pandas as pd
from io import StringIO
blob_client_instance = blobService.get_blob_client(
"testflorencia", "TakeUpStores.csv", snapshot=None)
downloaded_blob = blob_client_instance.download_blob()
blob = downloaded_blob.content_as_text(encoding=None)
df = pd.read_csv(StringIO(blob))
df
However I get this error:但是我收到此错误:
initial_value must be str or None, not bytes
I am not able to share the file here because its confidential, but what I did notice is that every 20 rows there is a special pagination row with a special character:我无法在此处共享该文件,因为它是机密文件,但我注意到每 20 行有一个带有特殊字符的特殊分页行:
= 37.364.304;;;; --> special character not rendered by StackOverflow
How can I read this csv into pandas and ignore those rows?如何将此 csv 读入 pandas 并忽略这些行?
I also tried without encoding parameter and I got adifferent error我也试过没有编码参数,但我得到了另一个错误
'utf-8' codec can't decode byte 0xc3 in position 16515: invalid continuation byte
Filter out the special rows from the downloaded text, then feed it to Pandas.从下载的文本中过滤掉特殊行,然后将其提供给 Pandas。
# ...
blob = downloaded_blob.content_as_text(encoding=None)
lines = "\n".join(line for line in blob.splitlines() if not line.startswith(" = ")) # or whatever is the criteria for a special row
df = pd.read_csv(StringIO(blob))
If all your special pagination rows are starting with the same single character, then you can make use of the comment
parameter :如果您所有的特殊分页行都以相同的单个字符开头,那么您可以使用
comment
参数:
comment str, optional
注释str,可选
Indicates remainder of line should not be parsed.
指示不应解析行的其余部分。 If found at the beginning of a line, the line will be ignored altogether.
如果在一行的开头找到,则该行将被完全忽略。 This parameter must be a single character.
此参数必须是单个字符。 Like empty lines (as long as
skip_blank_lines=True
), fully commented lines are ignored by the parameter header but not by skiprows.与空行一样(只要
skip_blank_lines=True
),参数 header 会忽略完全注释的行,但skirows 不会。 For example, ifcomment='#'
, parsing#empty\na,b,c\n1,2,3
withheader=0
will result in'a,b,c'
being treated as the header.例如,如果
comment='#'
,解析带有header=0
的#empty\na,b,c\n1,2,3
将导致'a,b,c'
被视为 header。
df = pd.read_csv(StringIO(blob), comment='=')
or depending on the first character of the pagination row:或取决于分页行的第一个字符:
df = pd.read_csv(StringIO(blob), comment=' ')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.