在python中刪除列名中不需要的字符

Question

我的第一列名稱中有不需要的字符。 這些字符在 excel、記事本、sublime 中是看不到的。

我從這里嘗試了一個技巧來檢查列名。 只有這樣，才能看到不需要的字符。

對此有什么好的解決方案嗎？

M1
Out[347]: 
        a1        b1        a2        b2
0  0.238066  0.976816  0.238066  0.976816
1  0.373340  1.469728  0.373340  1.469728
2  0.968814  1.248595  0.968814  1.248595
3  0.886586  3.451292  0.886586  3.451292
4  0.244301  2.206757  0.244301  2.206757
5  0.389688  2.893761  0.389688  2.893761
6  0.704340  2.621483  0.704340  2.621483
7  0.301238  1.678316  0.301238  1.678316
8  0.375927  0.574135  0.375927  0.574135
9  0.065749  2.259736  0.065749  2.259736

print(M1.columns.tolist())
['\ufeffa1', 'b1', 'a2', 'b2']

M1.columns = M1.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

print(M1.columns.tolist())
['\ufeffa1', 'b1', 'a2', 'b2']

Answer 1

請使用 'Some String'.encode('ascii', 'ignore') 它給出字節並使用解碼來獲取字符串

代碼：

lst = ['\ufeffa1', 'b1', 'a2', 'b2']
print(lst)
newlst = [s.encode('ascii', 'ignore').decode("utf-8") for s in lst]

print(newlst)

輸出：

['\ufeffa1', 'b1', 'a2', 'b2']
['a1', 'b1', 'a2', 'b2']

Answer 2

字符\ﻯ (U+FEFF) 是一個字節順序標記 (BOM) ，它是一個特殊字符，通知讀者編碼的“字節序”（小端與大端）。 BOM 對於 utf-8 是可選的，通常不會寫入。 您可能正在使用默認編碼“utf-8”（無 BOM 的 utf-8）讀取帶有 BOM 的 UTF-8 文件。 嘗試使用“utf-8-sig”（帶有 BOM 的 utf-8）。

# You file is probably encoded with 'utf-8-sig'. 
# You are decoding it with encoding='utf-8' (the default).
# This is what happens:
'hi there'.encode('utf-8-sig').decode('utf-8')
Out[14]: '\ufeffhi there'

'hi there'.encode('utf-8-sig').decode('utf-8-sig')
Out[15]: 'hi there'

編輯：“那么我應該如何處理文件？更改編碼並不能解決問題。”

您可以打開記事本++，然后格式 -> 轉換為 UTF-8。 或者在 Python 中：

with open(input_path, encoding='utf-8-sig') as fin:
    text = fin.read()
with open(output_path, 'w', encoding='utf-8') as fout:
    fout.write(text)

這將刪除 BOM。

Answer 3

這是編碼問題。

   ...: df = pd.DataFrame(np.random.randint(3,10,16).reshape(4,4), columns=['\ufeffa1', 'b1', 'a2', 'b2'])
   ...: df.head()
Out[3]: 
   a1  b1  a2  b2
0    7   7   9   6
1    5   9   6   7
2    4   8   4   3
3    6   9   8   7

In [4]: df.columns
Out[4]: Index(['a1', 'b1', 'a2', 'b2'], dtype='object')

In [5]: df.columns.to_list()
Out[5]: ['\ufeffa1', 'b1', 'a2', 'b2']

In [6]: df.columns = pd.Series(df.columns).apply(lambda x:x.encode('utf-8').decode('utf-8-sig'))

In [7]: df.columns
Out[7]: Index(['a1', 'b1', 'a2', 'b2'], dtype='object')

In [8]: df.columns.to_list()
Out[8]: ['a1', 'b1', 'a2', 'b2']

在python中刪除列名中不需要的字符

問題描述

3 個解決方案

解決方案1
0 2020-12-01 21:44:34

解決方案2
0 2020-12-01 21:50:32

解決方案3
0 2020-12-01 22:04:43

在python中刪除列名中不需要的字符

問題描述

3 個解決方案

解決方案1 0 2020-12-01 21:44:34

解決方案2 0 2020-12-01 21:50:32

解決方案3 0 2020-12-01 22:04:43

解決方案1
0 2020-12-01 21:44:34

解決方案2
0 2020-12-01 21:50:32

解決方案3
0 2020-12-01 22:04:43