[英]Pandas: How to open certain files
I am currently working on the data set from this link . 我目前正在处理来自此链接的数据集。 But I am unable to read these files from Pandas? 但是我无法从熊猫读取这些文件吗? Has anyone tried to play with such files? 有没有人尝试过播放此类文件?
I am trying the following: 我正在尝试以下方法:
import pandas as pd
df = pd.read_csv("m_4549381c276b46c6.0000")
But I get the following error 但我收到以下错误
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Those files are parts of a saved SFrame . 这些文件是保存的SFrame的一部分 。
So you can load them this way: 因此,您可以通过以下方式加载它们:
import sframe
sf = sframe.SFrame('/path/to/dir/')
Demo: I've downloaded all files from people_wiki.gl and put them under: D:/download/sframe/
演示:我已经从people_wiki.gl下载了所有文件,并将它们放在以下目录中: D:/download/sframe/
In [7]: import sframe
In [7]: sf = sframe.SFrame('D:/download/sframe/')
In [8]: sf
Out[8]:
Columns:
URI str
name str
text str
Rows: 59071
Data:
+-------------------------------+---------------------+
| URI | name |
+-------------------------------+---------------------+
| <http://dbpedia.org/resour... | Digby Morrell |
| <http://dbpedia.org/resour... | Alfred J. Lewy |
| <http://dbpedia.org/resour... | Harpdog Brown |
| <http://dbpedia.org/resour... | Franz Rottensteiner |
| <http://dbpedia.org/resour... | G-Enka |
| <http://dbpedia.org/resour... | Sam Henderson |
| <http://dbpedia.org/resour... | Aaron LaCrate |
| <http://dbpedia.org/resour... | Trevor Ferguson |
| <http://dbpedia.org/resour... | Grant Nelson |
| <http://dbpedia.org/resour... | Cathy Caruth |
+-------------------------------+---------------------+
+-------------------------------+
| text |
+-------------------------------+
| digby morrell born 10 octo... |
| alfred j lewy aka sandy le... |
| harpdog brown is a singer ... |
| franz rottensteiner born i... |
| henry krvits born 30 decem... |
| sam henderson born october... |
| aaron lacrate is an americ... |
| trevor ferguson aka john f... |
| grant nelson born 27 april... |
| cathy caruth born 1955 is ... |
+-------------------------------+
[59071 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
Now you can convert it to Pandas DF if you need: 现在,您可以根据需要将其转换为Pandas DF:
In [17]: df = sf.to_dataframe()
In [18]: pd.options.display.max_colwidth = 40
In [19]: df.head()
Out[19]:
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
In [20]: df.shape
Out[20]: (59071, 3)
Just clarifying on the answer by MaxU , you are trying to read it the wrong way. 只是澄清了MaxU的答案,您试图以错误的方式阅读它。 It is a raw file and its formatting is contained in the other files which are there in the same folder in that link . 它是一个原始文件,其格式包含在该链接的同一文件夹中的其他文件中。 Pandas requires you to know the encoded format of the file beforehand (ie delimiters, number of columns etc). 熊猫要求您事先知道文件的编码格式(即分隔符,列数等)。 It cannot be used as a magic wand to read any file without being aware of it. 在不知道任何文件的情况下,它不能用作魔术棒来读取任何文件。
The IPython notebook just outside the folder in your link , shows exactly how to read that data. 链接中文件夹外部的IPython笔记本确切显示了如何读取该数据。 MaxU has correctly mentioned that the specific file in question is just a part of the SFrame which is a structure of GraphLab framework. MaxU正确地提到了所涉及的特定文件只是SFrame的一部分,而SFrame是GraphLab框架的结构。 Hence, you are trying to extract meaningful data just from a part of the whole and hence you can't do that meaningfully. 因此,您试图仅从整体的一部分中提取有意义的数据,因此您将无法做到这一点。
You can however read the graphlab file and convert it into a Pandas dataframe. 但是,您可以读取graphlab文件并将其转换为Pandas数据框。 For details see here . 有关详细信息,请参见此处 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.