[英]Pandas read_table error
I am trying to read a tab delimited text file into a dataframe. 我正在尝试将制表符分隔的文本文件读入数据框。
This is the how the file looks in Excel: 这是文件在Excel中的外观:
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
Import into a df: 导入到df中:
df = p.read_table("E:/FileLoc/ThisIsAFile.txt", encoding = "iso-8859-1")
Now it doesn't see the first 3 columns as part of the column index (df[0] = Transaction Type) and all of the headers shift over to reflect this. 现在,它没有将前3列视为列索引的一部分(df [0] =事务类型),并且所有标头都转移了以反映这一点。
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
I am trying to manipulate the text file and then import it to a mysql database as an end result. 我试图操纵文本文件,然后将其导入到mysql数据库中作为最终结果。
You can use read_csv
with separator 2 and more whitespaces: 您可以将read_csv
与分隔符2和更多空格一起使用:
import pandas as pd
import io
temp=u"""CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1"""
#after testing replace io.StringIO(temp) to filename
df =pd.read_csv(io.StringIO(temp), sep=r'\s{2,}', engine='python', encoding = "iso-8859-1")
print (df)
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE \
0 5/13/2016 0:00 13867666 6892372 S
CUSTOMER_NUMBER CUSTOMER_NAME
0 2026 CUSTOMER 1
If separator is tabulator
, use sep='\\t'
. 如果分隔符是tabulator
,请使用sep='\\t'
。
EDIT: 编辑:
I test it with your data and it works: 我用您的数据对其进行了测试,并且可以正常工作:
import pandas as pd
df = pd.read_csv('test/AnonymizedData.txt', sep='\t')
print (df)
CUSTOMER_NUMBER CUSTOMER_NAME CUSTOMER_BRANCH_CODE CUSTOMER_BRANCH_NAME \
0 2026 CUSTOMER 1 83 SALES BRANCH 1
1 2359 CUSTOMER 2 76 SALES BRANCH 2
2 100662 CUSTOMER 3 28 SALES BRANCH 3
3 3245 CUSTOMER 4 84 SALES BRANCH 4
4 3179 CUSTOMER 5 28 SALES BRANCH 5
5 39881 CUSTOMER 6 67 SALES BRANCH 6
6 37020 CUSTOMER 7 58 SALES BRANCH 7
7 1239 CUSTOMER 8 50 SALES BRANCH 8
8 2379 CUSTOMER 9 76 SALES BRANCH 9
CUSTOMER_CITY CUSTOMER_STATE ... PRICING_PRODUCT_TYPE_CODE \
0 TOWN 1 CO ... 11
1 TOWN 2 OH ... 11
2 TOWN 3 ME ... 11
3 TOWN 4 IL ... 11
4 TOWN 5 NH ... 11
5 TOWN 6 TX ... 11
6 TOWN 7 NC ... 11
7 TOWN 8 NY ... 11
8 TOWN 9 OH ... 11
PRICING_PRODUCT_TYPE ORGANIZATION_ID ORGANIZATION_NAME PRODUCT_LINE_CODE \
0 DISPOSABLES 83 ORGANIZATIONNAME 891
1 DISPOSABLES 83 ORGANIZATIONNAME 891
2 DISPOSABLES 83 ORGANIZATIONNAME 891
3 DISPOSABLES 83 ORGANIZATIONNAME 891
4 DISPOSABLES 83 ORGANIZATIONNAME 891
5 DISPOSABLES 83 ORGANIZATIONNAME 891
6 DISPOSABLES 83 ORGANIZATIONNAME 891
7 DISPOSABLES 83 ORGANIZATIONNAME 891
8 DISPOSABLES 83 ORGANIZATIONNAME 891
PRODUCT_LINE ROBOTIC_FLAG Unnamed: 52 Unnamed: 53 Unnamed: 54
0 PRODUCTNAME N N NaN 3
1 PRODUCTNAME N N NaN 3
2 PRODUCTNAME N N NaN 2
3 PRODUCTNAME N N NaN 7
4 PRODUCTNAME N N NaN 1
5 PRODUCTNAME N N NaN 4
6 PRODUCTNAME N N NaN 3
7 PRODUCTNAME N N NaN 5
8 PRODUCTNAME N N NaN 3
[9 rows x 55 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.