[英]How do I find specific rows that throws an error when saving in Graphlab SFrame?
I have a SFrame
that looks like this with sf.print_rows(10)
: 我有一个SFrame
sf.print_rows(10)
看起来像这样的sf.print_rows(10)
:
+--------------+---------------+-------+-------------------------------+
| Dataset | Domain | Score | Sent1 |
+--------------+---------------+-------+-------------------------------+
| STS2012-gold | surprise.OnWN | 5.0 | render one language in ano... |
| STS2012-gold | surprise.OnWN | 3.25 | nations unified by shared ... |
| STS2012-gold | surprise.OnWN | 3.25 | convert into absorbable su... |
| STS2012-gold | surprise.OnWN | 4.0 | devote or adapt exclusivel... |
| STS2012-gold | surprise.OnWN | 3.25 | elevated wooden porch of a... |
| STS2012-gold | surprise.OnWN | 4.0 | either half of an archery bow |
| STS2012-gold | surprise.OnWN | 3.333 | a removable device that is... |
| STS2012-gold | surprise.OnWN | 4.75 | restrict or confine |
| STS2012-gold | surprise.OnWN | 0.5 | orient, be positioned |
| STS2012-gold | surprise.OnWN | 4.75 | Bring back to life, return... |
+--------------+---------------+-------+-------------------------------+
+-------------------------------+-------------------------------+
| Sent2 | Sent1_tokenized |
+-------------------------------+-------------------------------+
| restate (words) from one l... | [render, one, language, in... |
| a group of nations having ... | [nations, unified, by, sha... |
| soften or disintegrate by ... | [convert, into, absorbable... |
| devote oneself to a specia... | [devote, or, adapt, exclus... |
| a porch that resembles the... | [elevated, wooden, porch, ... |
| either of the two halves o... | [either, half, of, an, arc... |
| a supplementary part or ac... | [a, removable, device, tha... |
| place limits on (extent or... | [restrict, or, confine] |
| be opposite. | [orient,, be, positioned] |
| cause to become alive again. | [Bring, back, to, life,, r... |
+-------------------------------+-------------------------------+
+-------------------------------+-----------+-----------+----------------------+
| Sent2_tokenized | Sent1_len | Sent2_len | NGRAM-cosChar2ngrams |
+-------------------------------+-----------+-----------+----------------------+
| [restate, (words), from, o... | 6 | 8 | 0.82090085 |
| [a, group, of, nations, ha... | 8 | 7 | 0.53250804 |
| [soften, or, disintegrate,... | 11 | 11 | 0.43274232 |
| [devote, oneself, to, a, s... | 10 | 8 | 0.47759567 |
| [a, porch, that, resembles... | 6 | 9 | 0.38885689 |
| [either, of, the, two, hal... | 6 | 12 | 0.55555556 |
| [a, supplementary, part, o... | 10 | 5 | 0.44963552 |
| [place, limits, on, (exten... | 3 | 6 | 0.27124449 |
| [be, opposite.] | 3 | 2 | 0.43528575 |
| [cause, to, become, alive,... | 8 | 5 | 0.37047929 |
+-------------------------------+-----------+-----------+----------------------+
+----------------------+----------------------+----------------------+
| NGRAM-cosChar3ngrams | NGRAM-cosChar4ngrams | NGRAM-cosChar5ngrams |
+----------------------+----------------------+----------------------+
| 0.74964917 | 0.71490469 | 0.67925959 |
| 0.36701702 | 0.28941438 | 0.23635427 |
| 0.25899951 | 0.21053227 | 0.17058877 |
| 0.26248718 | 0.20518234 | 0.14285714 |
| 0.17107978 | 0.12049505 | 0.09320546 |
| 0.40754381 | 0.24715577 | 0.11547005 |
| 0.21997067 | 0.17554945 | 0.15450786 |
| 0.13284223 | 0.09284767 | 0.048795 |
| 0.31426968 | 0.17149859 | 0.09449112 |
| 0.0632772 | 0.03402069 | 0.0 |
+----------------------+----------------------+----------------------+
+---------------------+---------------------+---------------------+---------------------+
[19097 rows x 134 columns]
But when I tried to save it into a csv with sf.save('trainers.csv', format='csv')
, it throws an error: 但是,当我尝试使用sf.save('trainers.csv', format='csv')
将其保存到csv时,会引发错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-23-f82bcb3fa197> in <module>()
----> 1 sts.save('trainers.csv', format='csv')
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in save(self, filename, format)
2924 self.export_json(url)
2925 else:
-> 2926 raise ValueError("Unsupported format: {}".format(format))
2927
2928 def export_csv(self, filename, delimiter=',', line_terminator='\n',
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero
I print the n no. 我打印n号。 of rows one at a time, eg sf.print_rows(10)
, sf.print_rows(100)
and at sf.print_rows(129)
, it throws an error: 一次显示一行,例如sf.print_rows(10)
, sf.print_rows(100)
和sf.print_rows(129)
,则会引发错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-24-13550768dbcd> in <module>()
----> 1 sts.print_rows(129)
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in print_rows(self, num_rows, num_columns, max_column_width, max_row_width, output_file)
2226 max_row_width = max(max_row_width, max_column_width + 1)
2227
-> 2228 printed_sf = self._imagecols_to_stringcols(num_rows)
2229 row_of_tables = printed_sf.__get_pretty_tables__(wrap_text=False,
2230 max_rows_to_display=num_rows,
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in _imagecols_to_stringcols(self, num_rows)
2250 if t in image_column_names:
2251 printed_sf[t] = self[t].astype(str)
-> 2252 return printed_sf.head(num_rows)
2253
2254 def __str_impl__(self, num_rows=10, footer=True):
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in head(self, n)
2454 tail, print_rows
2455 """
-> 2456 return SFrame(_proxy=self.__proxy__.head(n))
2457
2458 def to_dataframe(self):
graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.head()
graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.head()
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero
So I did a sf.fillna(c, 0)
: 所以我做了一个sf.fillna(c, 0)
:
for c in sts.column_names():
sts = sts.fillna(c, 0)
and it throws another error: 并引发另一个错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-26-e63cf73308dd> in <module>()
1 for c in sts.column_names():
----> 2 sts = sts.fillna(c, 0)
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in fillna(self, column, value)
5652 raise TypeError("Must give column name as a str")
5653 ret = self[self.column_names()]
-> 5654 ret[column] = ret[column].fillna(value)
5655 return ret
5656
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in fillna(self, value)
2439
2440 with cython_context():
-> 2441 return SArray(_proxy = self.__proxy__.fill_missing_values(value))
2442
2443 def topk_index(self, topk=10, reverse=False):
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Default value must be convertible to column type
How do I find specific rows that throws an error when saving in Graphlab SFrame? 在Graphlab SFrame中保存时,如何查找抛出错误的特定行?
And how do I fix this row? 以及如何解决这一行? Can I just replace the problematic columns in the rows with fillna()
? 我可以只用fillna()
替换行中有问题的列吗? I can't really throw the rows away with dropna()
since I need to keep track of the problematic rows. 我真的不能使用dropna()
丢弃行,因为我需要跟踪有问题的行。
But even with dropna()
, I end up with: 但是即使使用dropna()
,我最终还是:
sf.dropna()
sf.save('trainers.csv', format='csv')
How do I find these rows that gives me Errors or ZeroDivisionErrors? 我如何找到这些行给我Errors或ZeroDivisionErrors? And how to correct them or fill these columns with zeros? 以及如何更正它们或用零填充这些列?
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-28-f82bcb3fa197> in <module>()
----> 1 sts.save('trainers.csv', format='csv')
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in save(self, filename, format)
2924 self.export_json(url)
2925 else:
-> 2926 raise ValueError("Unsupported format: {}".format(format))
2927
2928 def export_csv(self, filename, delimiter=',', line_terminator='\n',
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero
Strangely, I cannot iterate through the SFrame, when I try to iterate through the SFrame with: 奇怪的是,当我尝试使用以下方法遍历SFrame时,无法遍历SFrame:
for i in sf:
print i
It throws this error: 它引发此错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-29-d2d0035d7bbe> in <module>()
----> 1 for i in sts:
2 print i
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in generator()
3712 def generator():
3713 elems_at_a_time = 262144
-> 3714 self.__proxy__.begin_iterator()
3715 ret = self.__proxy__.iterator_get_next(elems_at_a_time)
3716 column_names = self.column_names()
graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.begin_iterator()
graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.begin_iterator()
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 10, in <lambda>
TypeError: 'NoneType' object is not iterable
It gets stranger, I couldn't retrieve a specific row with sf[num]
but I can do a sub-SFrame and then retrieve that particular num
row. 奇怪的是,我无法使用sf[num]
检索特定行,但可以执行sub-SFrame然后检索该特定num
行。 So this: 所以这:
print sf[25]
breaks and throws: 中断并抛出:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-62-6bc8898704c0> in <module>()
----> 1 print sts[25]
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in __getitem__(self, key)
3595 ub = min(sf_len, lb + block_size)
3596
-> 3597 val_list = list(SFrame(_proxy = self.__proxy__.copy_range(lb, 1, ub)))
3598 self._cache["getitem_cache"] = (lb, ub, val_list)
3599 return val_list[key - lb]
graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.copy_range()
graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.copy_range()
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 10, in <lambda>
TypeError: 'NoneType' object is not iterable
But when I try to extract a subset and then print, it works. 但是,当我尝试提取一个子集然后进行打印时,它可以工作。 The code below retrieves the 25th element that was previously error-throwing with the code above: 下面的代码检索以前与上面的代码一起抛出错误的第25个元素:
x = sf[:30]
print x[25]
Is there a reason for why the previous code with sf[25]
throws a NoneType
? 为什么前面带有sf[25]
代码会抛出NoneType
? sf[0]
to sf[24]
works but anything above 25 didn't. sf[0]
至sf[24]
有效,但高于25的任何操作均无效。
Appparently, iterating the SFrame this way and dumping it out as str sorta works: 显然,以这种方式迭代SFrame并在str sorta工作时将其转储出去:
fout = open('superbad.txt', 'w')
sflen = len(sf)
i = 0
while i < sflen:
m = i+100 if i+100 < sflen else sflen
x = sf[i:m]
for j in x:
fout.write(str(j) +'\n\n')
It's rather strange. 真奇怪 Why is that iterating in chunks and dumping to string works? 为什么要在块中进行迭代并转储到字符串中呢?
The issue is the division by zero error that you have when running an apply (somewhere above the save) 问题是运行应用程序时(除保存之外的某处)您遇到零除错误
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero
This happens because of lazy evaluation ( https://en.wikipedia.org/wiki/Lazy_evaluation ). 发生这种情况是由于延迟评估( https://en.wikipedia.org/wiki/Lazy_evaluation )。 As an example, suppose I start with an SFrame with a single column 例如,假设我从一个单列的SFrame开始
sf = gl.SFrame({'x': range(10000, -1, -1)})
sf['x'].apply(lambda x: 1.0/x)
At this point of time, the last row of the SFrame contains a 1.0/0
value which is an error, but this has not been evaluated yet. 此时,SFrame的最后一行包含一个1.0/0
值,这是一个错误,但尚未对此求值。 The save
method triggers a materialization ie an actual computation of all the rows in the data which then causes the error to happen. save
方法触发实现,即对数据中所有行的实际计算,然后导致发生错误。 You can trigger this process using a call to __materialize__
您可以通过调用__materialize__
来触发此过程
sf.__materialize__()
which causes the following error to occur. 这将导致以下错误发生。
RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-55-5af90e232e2d>", line 1, in <lambda>
ZeroDivisionError: float division by zero
Lazy evaluation and query planning is really important as a performance optimization and is one of the reasons why the SFrame is fast and scalable. 懒惰的评估和查询计划对于性能优化确实非常重要,这也是SFrame快速且可扩展的原因之一。 Unfortunately, tracing errors is one of the annoyances of it, but you do get used to it once you are aware of how it works. 不幸的是,跟踪错误是它的烦恼之一,但是一旦您意识到它的工作原理,您就会习惯它。
The head()
function does not trigger a full materialization so you can execute it on as many rows as you want until you find the error. head()
函数不会触发完全实现,因此您可以根据需要在任意多行上执行它,直到找到错误为止。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.