[英]Pandas Unable to Read CSV file using pandas, with extra quote char
i have following CSV with following entries我有以下 CSV 和以下条目
"column1"|
“列1”| "column2"|
“列2”| "column3"|
“第 3 列”| "column4"|
“第 4 列”| "column5"
“第 5 列”
"123" |“123” | "sometext", "this somedata", "8 inches"", "hello"
“sometext”、“this somedata”、“8 英寸”、“你好”
The issue comes when i try to read 8 inches"
, i am unable to read the csv using read_csv()
.当我尝试读取
8 inches"
,我无法使用read_csv()
读取 csv。
Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
quoting=1,
engine='c', error_bad_lines=False, warn_bad_lines=True,
encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')
Is there a way to handle the quote within the cell.有没有办法处理单元格内的报价。
Start from passing appropriate parameters for this case:从为这种情况传递适当的参数开始:
The above options alone allow to call read_csv with no error, but the downside (for now) is that double quotes remain.仅上述选项就可以毫无错误地调用read_csv ,但缺点(目前)是双引号仍然存在。
To eliminate them, at least from the data rows, another trick is needed:为了至少从数据行中消除它们,需要另一个技巧:
Define a converter (lambda) function:定义一个转换器 (lambda) function:
cnv = lambda txt: txt.replace('"', '')
and apply it to all source columns.并将其应用于所有源列。
In your case you have 5 columns, so to keep the code concise, you can use a dictionary comprehension :在您的情况下,您有 5 列,因此为了保持代码简洁,您可以使用字典理解:
{ i: cnv for i in range(5) }
So the whole code can be:所以整个代码可以是:
df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
engine='python', converters={ i: cnv for i in range(5) })
and the result is:结果是:
"column1" "column2" "column3" "column4" "column5"
0 123 sometext this somedata 8 inches hello
But remember that now all columns are of string type, so you should convert required columns to numbers.但请记住,现在所有列都是字符串类型,因此您应该将所需的列转换为数字。 An alternative is to pass second converter for numeric columns, returning a number instead of a string.
另一种方法是为数字列传递第二个转换器,返回数字而不是字符串。
To have proper column names (without double quotes), you can pass additional parameters:要获得正确的列名(不带双引号),您可以传递其他参数:
We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:我们可以指定一个稍微复杂的分隔符,读取数据并去除多余的引号字符:
# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)
df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]
The result:结果:
column1 column2 column3 column4 column5
0 123 sometext this somedata 8 inches hello
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.