簡體   English   中英

熊貓:read_csv表示“以空格分隔”

[英]Pandas: read_csv indicating 'space-delimited'

我有以下file.txt(摘要):

SICcode        Catcode        Category                              SICname        MultSIC
0111        A1500        Wheat, corn, soybeans and cash grain        Wheat        X
0112        A1600        Other commodities (incl rice, peanuts)      Rice        X
0115        A1500        Wheat, corn, soybeans and cash grain        Corn        X
0116        A1500        Wheat, corn, soybeans and cash grain        Soybeans        X
0119        A1500        Wheat, corn, soybeans and cash grain        Cash grains, NEC        X
0131        A1100        Cotton        Cotton        X
0132        A1300        Tobacco & Tobacco products                  Tobacco        X

將其讀入pandas df時遇到一些問題。 我嘗試使用以下規格的pd.read_csv engine='python', sep='Tab'但它在一列中返回了文件:

    SICcode Catcode Category SICname MultSIC
0   0111 A1500 Wheat, corn, soybeans...
1   0112 A1600 Other commodities (in...
2   0115 A1500 Wheat, corn, soybeans...
3   0116 A1500 Wheat, corn, soybeans...

然后,我嘗試使用“ tab”作為分隔符將其放入一個數字文件中,但它將文件讀為一列。 有人對此有想法嗎?

如果df = pd.read_csv('file.txt', sep='\\t')返回帶有一列的DataFrame,則顯然file.txt沒有使用制表符作為分隔符。 您的數據可能只包含空格作為分隔符。 在這種情況下,您可以嘗試

df = pd.read_csv('data', sep=r'\s{2,}')

它使用正則表達式模式\\s{2,}作為分隔符。 此正則表達式匹配2個或多個空格字符。

In [8]: df
Out[8]: 
   SICcode Catcode                                Category           SICname  \
0      111   A1500    Wheat, corn, soybeans and cash grain             Wheat   
1      112   A1600  Other commodities (incl rice, peanuts)              Rice   
2      115   A1500    Wheat, corn, soybeans and cash grain              Corn   
3      116   A1500    Wheat, corn, soybeans and cash grain          Soybeans   
4      119   A1500    Wheat, corn, soybeans and cash grain  Cash grains, NEC   
5      131   A1100                                  Cotton            Cotton   
6      132   A1300              Tobacco & Tobacco products           Tobacco   

  MultSIC  
0       X  
1       X  
2       X  
3       X  
4       X  
5       X  
6       X  

如果這不起作用,請發布print(repr(open(file.txt, 'rb').read(100)) 。這將向我們顯示file.txt的前100個字節的明確表示。

我認為如果csv中的數據由Tabulator分隔,則可以嘗試將sep="\\t"添加到read_csv中。

import pandas as pd

df = pd.read_csv('test/a.csv', sep="\t")
print df
   SICcode Catcode                               Category           SICname  \
0      111   A1500   Wheat, corn, soybeans and cash grain             Wheat   
1      112   A1600  ther commodities (incl rice, peanuts)              Rice   
2      115   A1500   Wheat, corn, soybeans and cash grain              Corn   
3      116   A1500   Wheat, corn, soybeans and cash grain          Soybeans   
4      119   A1500   Wheat, corn, soybeans and cash grain  Cash grains, NEC   
5      131   A1100                                 Cotton            Cotton   
6      132   A1300             Tobacco & Tobacco products           Tobacco   

  MultSIC  
0       X  
1       X  
2       X  
3       X  
4       X  
5       X  
6       X  

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM