[英].dat file import in pandas
I want to import this publicly available file using pandas. 我想使用熊猫导入这个公开可用的文件 。 Simply as csv (I have renamed simply .dat to .csv): 就像csv一样(我已经简单地将.dat重命名为.csv):
clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")
However in some cases country name is composed of two words, not just one. 但是,在某些情况下,国名由两个词组成,而不仅仅是一个词。 In those cases shifts my data frame to the right. 在那种情况下,将我的数据框向右移动。 This looks like (name hot springs is in two columns): 看起来像(名称温泉在两列中): How to fix it for the entire dataset at once? 如何一次修复整个数据集?
No need to rename the .dat to .csv. 无需将.dat重命名为.csv。 Instead you can use a regex that matches two or more spaces as a column separator. 相反,您可以使用匹配两个或多个空格的正则表达式作为列分隔符。
Try use sep
parameter: 尝试使用sep
参数:
pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
header=None, sep='\s\s+', engine='python')
Output: 输出:
0 1 2 3 4 5 6 7 8 9 10
0 Autauga, AL 30.92 31.7 57623 15768 15.2 10.74 51.41 60.4 2.36 457
1 Baldwin, AL 26.24 35.5 84935 16954 13.6 9.73 51.34 66.5 5.40 282
2 Barbour, AL 46.36 32.8 83656 15532 25.0 8.82 53.03 28.8 7.02 47
3 Blount, AL 32.92 34.5 61249 14820 15.0 9.67 51.15 62.4 2.36 185
4 Bullock, AL 67.67 31.7 75725 11120 33.0 7.08 50.76 17.6 2.91 141
If you want your state as a seperate column you can use this sep='\\s\\s+|,' which means seperate columns on two spaces or more OR a comma. 如果要将状态作为单独的列,则可以使用此sep ='\\ s \\ s + |,这表示两个或多个空格或逗号之间的单独的列。
pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
header=None, sep='\s\s+|,', engine='python')
Output: 输出:
0 1 2 3 4 5 6 7 8 9 10 11
0 Autauga AL 30.92 31.7 57623 15768.0 15.2 10.74 51.41 60.4 2.36 457.0
1 Baldwin AL 26.24 35.5 84935 16954.0 13.6 9.73 51.34 66.5 5.40 282.0
2 Barbour AL 46.36 32.8 83656 15532.0 25.0 8.82 53.03 28.8 7.02 47.0
3 Blount AL 32.92 34.5 61249 14820.0 15.0 9.67 51.15 62.4 2.36 185.0
4 Bullock AL 67.67 31.7 75725 11120.0 33.0 7.08 50.76 17.6 2.91 141.0
You can use a regular expression as a separator. 您可以使用正则表达式作为分隔符。 In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces. 在您的特定情况下,所有定界符都不止一个空格,而名称中的空格只是单个空格。
import pandas as pd
clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.