简体   繁体   English

.dat文件导入熊猫

[英].dat file import in pandas

I want to import this publicly available file using pandas. 我想使用熊猫导入这个公开可用的文件 Simply as csv (I have renamed simply .dat to .csv): 就像csv一样(我已经简单地将.dat重命名为.csv):

clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")

However in some cases country name is composed of two words, not just one. 但是,在某些情况下,国名由两个词组成,而不仅仅是一个词。 In those cases shifts my data frame to the right. 在那种情况下,将我的数据框向右移动。 This looks like (name hot springs is in two columns): 看起来像(名称温泉在两列中): 在此处输入图片说明 How to fix it for the entire dataset at once? 如何一次修复整个数据集?

No need to rename the .dat to .csv. 无需将.dat重命名为.csv。 Instead you can use a regex that matches two or more spaces as a column separator. 相反,您可以使用匹配两个或多个空格的正则表达式作为列分隔符。

Try use sep parameter: 尝试使用sep参数:

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+', engine='python')

Output: 输出:

            0      1     2      3      4     5      6      7     8     9    10
0  Autauga, AL  30.92  31.7  57623  15768  15.2  10.74  51.41  60.4  2.36  457
1  Baldwin, AL  26.24  35.5  84935  16954  13.6   9.73  51.34  66.5  5.40  282
2  Barbour, AL  46.36  32.8  83656  15532  25.0   8.82  53.03  28.8  7.02   47
3   Blount, AL  32.92  34.5  61249  14820  15.0   9.67  51.15  62.4  2.36  185
4  Bullock, AL  67.67  31.7  75725  11120  33.0   7.08  50.76  17.6  2.91  141

If you want your state as a seperate column you can use this sep='\\s\\s+|,' which means seperate columns on two spaces or more OR a comma. 如果要将状态作为单独的列,则可以使用此sep ='\\ s \\ s + |,这表示两个或多个空格或逗号之间的单独的列。

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+|,', engine='python')

Output: 输出:

        0    1      2     3      4        5     6      7      8     9     10     11
0  Autauga   AL  30.92  31.7  57623  15768.0  15.2  10.74  51.41  60.4  2.36  457.0
1  Baldwin   AL  26.24  35.5  84935  16954.0  13.6   9.73  51.34  66.5  5.40  282.0
2  Barbour   AL  46.36  32.8  83656  15532.0  25.0   8.82  53.03  28.8  7.02   47.0
3   Blount   AL  32.92  34.5  61249  14820.0  15.0   9.67  51.15  62.4  2.36  185.0
4  Bullock   AL  67.67  31.7  75725  11120.0  33.0   7.08  50.76  17.6  2.91  141.0

You can use a regular expression as a separator. 您可以使用正则表达式作为分隔符。 In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces. 在您的特定情况下,所有定界符都不止一个空格,而名称中的空格只是单个空格。

import pandas as pd

clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM