[英]How to read txt file in Pandas: Error tokenizing data
問題:我用pandas.read_csv讀取txt文件,但是有一些錯誤。 流程如下圖:
import pandas as pd
txt 文件的路徑: './Data/fold2_l25431/test.txt'
例子test.txt的內容:(txt的前三行,讀入的時候,想分成三列,'1','2','3'在一列,'persona'在一列, 以及一欄中冒號后的句子)
First line: 1 persona: i am adorkable.
Second line: 2 persona: i am book dumb.
Third line: 3 persona: i am token evil teammate.
代碼: pd.read_csv('./Data/fold2_l25431/test.txt')
或pd.read_csv('./Data/fold2_l25431/test.txt', sep=" ")
ParserError: Error tokenizing data. C error: Expected 8 fields in line 6, saw 9
嘗試這個:
import pandas as pd
pd.read_csv( 'test.txt',header=None ,on_bad_lines='skip')
我無法重現您的錯誤。
一般建議:
sep=":"
和換行符l.neterminator="\n"
on_bad_lines="skip"
並檢查您的 output錯誤的原因是 SPACE (sep = " ")
。 使用其他東西(如,
或|
來分隔字段。用逗號更新的表看起來像這樣
1, persona:, i am adorkable.
2, persona:, i am book dumb.
3, persona:, i am token evil teammate.
4, persona:, i am never my fault.
5, persona:, i am honor before reason.
6, persona:, i am jerk with a heart of gold.
7, persona:, i am no social skills.
8, persona:, i am bad liar
.. 應該使用此命令pd.read_csv('test1.txt', sep = ",", header = None)
output 將是
0 1 2
1 persona: i am adorkable.
2 persona: i am book dumb.
3 persona: i am token evil teammate.
4 persona: i am never my fault.
5 persona: i am honor before reason.
6 persona: i am jerk with a heart of gold.
7 persona: i am no social skills.
8 persona: i am bad liar
您的文件不是 csv,因此您可能必須自己編寫 function 才能讀取它並拆分為列
我使用io
只是為了模擬 memory 中的文件 - 所以每個人都可以復制和測試它 - 但你應該使用open()
text = '''1 persona: i am adorkable.
2 persona: i am book dumb.
3 persona: i am token evil teammate.
4 persona: i am never my fault.
5 persona: i am honor before reason.
6 persona: i am jerk with a heart of gold.
7 persona: i am no social skills.
8 persona: i am bad liar'''
import io
#f = open('./Data/fold2_l25431/test.txt')
f = io.StringIO(text)
rows = []
for line in f:
line = line.strip() # remove '\n'
first, rest = line.split(' ', 1) # split only on first space
second, third = rest.split(': ') # split on ": "
rows.append( [first, second, third] )
print(rows)
結果:
[
['1', 'persona', 'i am adorkable.'],
['2', 'persona', 'i am book dumb.'],
['3', 'persona', 'i am token evil teammate.'],
['4', 'persona', 'i am never my fault.'],
['5', 'persona', 'i am honor before reason.'],
['6', 'persona', 'i am jerk with a heart of gold.'],
['7', 'persona', 'i am no social skills.'],
['8', 'persona', 'i am bad liar']
]
稍后您可以將此列表轉換為DataFrame
import pandas as pd
df = pd.DataFrame(rows, columns=['1', '2', '3'])
print(df)
結果:
1 2 3
0 1 persona i am adorkable.
1 2 persona i am book dumb.
2 3 persona i am token evil teammate.
3 4 persona i am never my fault.
4 5 persona i am honor before reason.
5 6 persona i am jerk with a heart of gold.
6 7 persona i am no social skills.
7 8 persona i am bad liar
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.