Python / Pandas：Excel CSV UTF-8 文件中的列標題問題

Question

這是我第一次在 Pandas 中嘗試 UTF-8，所以可能是一個新手錯誤。

我在 Excel 中有一個簡單的測試表，我將其保存為 UTF-8 CSV。 在 Linux 上查看帶有“less”的文件給了我這個：

<U+FEFF>sample;chead
1;test

而“hexdump -C”這個：

00000000  ef bb bf 73 61 6d 70 6c  65 3b 63 68 65 61 64 0d  |...sample;chead.|
00000010  0a 31 3b 74 65 73 74 0d  0a                       |.1;test..|

到目前為止，很好，我會假設這是一個正確的 UTF-8 文件。

我現在想將該文件讀入熊貓數據幀並檢查第一列的名稱是“樣本”還是“探針”。

#!/usr/bin/env python3

import pandas as pd

df = pd.read_csv("sample1.csv", encoding="utf-8", sep=None, engine="python")

cols = [x.lower() for x in df.columns.values]
print("Columns:", cols)
print("Columns[0]:", cols[0])
print("type Columns[0]:", type(cols[0]))

# I expect this not to print, but it does
if cols[0] not in ["sample", "probe"]:
     print("Ouch, cols[0] is not 'sample' or 'probe'???")

上面程序的輸出是：

Columns: ['\ufeffsample', 'chead']
Columns[0]: sample
type Columns[0]: <class 'str'>
Ouch, cols[0] is not 'sample' or 'probe'???

從輸出的第一行我確實理解（以某種方式）cols[0] 值是 '\sample'，但是由於通過 print() 語句的輸出是“sample”，我不明白為什么“if”觸發.

我需要更改什么才能使“if”語句起作用？

Answer 1

<U+FEFF>是字節順序標記，參見https://en.wikipedia.org/wiki/Byte_order_mark 。

要使用這些文件讀入 Pandas 中的文件，您可以按照https://github.com/pandas-dev/pandas/issues/4793 中的建議將編碼設置為utf-8-sig 。

Python / Pandas：Excel CSV UTF-8 文件中的列標題問題

問題描述

1 個解決方案

解決方案1
2 已采納 2019-11-25 17:04:09

Python / Pandas：Excel CSV UTF-8 文件中的列標題問題

問題描述

1 個解決方案

解決方案1 2 已采納 2019-11-25 17:04:09

解決方案1
2 已采納 2019-11-25 17:04:09