简体   繁体   English

在Graphlab SFrame中保存时,如何查找抛出错误的特定行?

[英]How do I find specific rows that throws an error when saving in Graphlab SFrame?

I have a SFrame that looks like this with sf.print_rows(10) : 我有一个SFrame sf.print_rows(10)看起来像这样的sf.print_rows(10)

+--------------+---------------+-------+-------------------------------+
|   Dataset    |     Domain    | Score |             Sent1             |
+--------------+---------------+-------+-------------------------------+
| STS2012-gold | surprise.OnWN |  5.0  | render one language in ano... |
| STS2012-gold | surprise.OnWN |  3.25 | nations unified by shared ... |
| STS2012-gold | surprise.OnWN |  3.25 | convert into absorbable su... |
| STS2012-gold | surprise.OnWN |  4.0  | devote or adapt exclusivel... |
| STS2012-gold | surprise.OnWN |  3.25 | elevated wooden porch of a... |
| STS2012-gold | surprise.OnWN |  4.0  | either half of an archery bow |
| STS2012-gold | surprise.OnWN | 3.333 | a removable device that is... |
| STS2012-gold | surprise.OnWN |  4.75 |      restrict or confine      |
| STS2012-gold | surprise.OnWN |  0.5  |     orient, be positioned     |
| STS2012-gold | surprise.OnWN |  4.75 | Bring back to life, return... |
+--------------+---------------+-------+-------------------------------+
+-------------------------------+-------------------------------+
|             Sent2             |        Sent1_tokenized        |
+-------------------------------+-------------------------------+
| restate (words) from one l... | [render, one, language, in... |
| a group of nations having ... | [nations, unified, by, sha... |
| soften or disintegrate by ... | [convert, into, absorbable... |
| devote oneself to a specia... | [devote, or, adapt, exclus... |
| a porch that resembles the... | [elevated, wooden, porch, ... |
| either of the two halves o... | [either, half, of, an, arc... |
| a supplementary part or ac... | [a, removable, device, tha... |
| place limits on (extent or... |    [restrict, or, confine]    |
|          be opposite.         |   [orient,, be, positioned]   |
|  cause to become alive again. | [Bring, back, to, life,, r... |
+-------------------------------+-------------------------------+
+-------------------------------+-----------+-----------+----------------------+
|        Sent2_tokenized        | Sent1_len | Sent2_len | NGRAM-cosChar2ngrams |
+-------------------------------+-----------+-----------+----------------------+
| [restate, (words), from, o... |     6     |     8     |      0.82090085      |
| [a, group, of, nations, ha... |     8     |     7     |      0.53250804      |
| [soften, or, disintegrate,... |     11    |     11    |      0.43274232      |
| [devote, oneself, to, a, s... |     10    |     8     |      0.47759567      |
| [a, porch, that, resembles... |     6     |     9     |      0.38885689      |
| [either, of, the, two, hal... |     6     |     12    |      0.55555556      |
| [a, supplementary, part, o... |     10    |     5     |      0.44963552      |
| [place, limits, on, (exten... |     3     |     6     |      0.27124449      |
|        [be, opposite.]        |     3     |     2     |      0.43528575      |
| [cause, to, become, alive,... |     8     |     5     |      0.37047929      |
+-------------------------------+-----------+-----------+----------------------+
+----------------------+----------------------+----------------------+
| NGRAM-cosChar3ngrams | NGRAM-cosChar4ngrams | NGRAM-cosChar5ngrams |
+----------------------+----------------------+----------------------+
|      0.74964917      |      0.71490469      |      0.67925959      |
|      0.36701702      |      0.28941438      |      0.23635427      |
|      0.25899951      |      0.21053227      |      0.17058877      |
|      0.26248718      |      0.20518234      |      0.14285714      |
|      0.17107978      |      0.12049505      |      0.09320546      |
|      0.40754381      |      0.24715577      |      0.11547005      |
|      0.21997067      |      0.17554945      |      0.15450786      |
|      0.13284223      |      0.09284767      |       0.048795       |
|      0.31426968      |      0.17149859      |      0.09449112      |
|      0.0632772       |      0.03402069      |         0.0          |
+----------------------+----------------------+----------------------+
+---------------------+---------------------+---------------------+---------------------+

[19097 rows x 134 columns]

But when I tried to save it into a csv with sf.save('trainers.csv', format='csv') , it throws an error: 但是,当我尝试使用sf.save('trainers.csv', format='csv')将其保存到csv时,会引发错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-f82bcb3fa197> in <module>()
----> 1 sts.save('trainers.csv', format='csv')

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in save(self, filename, format)
   2924                 self.export_json(url)
   2925             else:
-> 2926                 raise ValueError("Unsupported format: {}".format(format))
   2927 
   2928     def export_csv(self, filename, delimiter=',', line_terminator='\n',

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

I print the n no. 我打印n号。 of rows one at a time, eg sf.print_rows(10) , sf.print_rows(100) and at sf.print_rows(129) , it throws an error: 一次显示一行,例如sf.print_rows(10)sf.print_rows(100)sf.print_rows(129) ,则会引发错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-24-13550768dbcd> in <module>()
----> 1 sts.print_rows(129)

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in print_rows(self, num_rows, num_columns, max_column_width, max_row_width, output_file)
   2226         max_row_width = max(max_row_width, max_column_width + 1)
   2227 
-> 2228         printed_sf = self._imagecols_to_stringcols(num_rows)
   2229         row_of_tables = printed_sf.__get_pretty_tables__(wrap_text=False,
   2230                                                          max_rows_to_display=num_rows,

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in _imagecols_to_stringcols(self, num_rows)
   2250                 if t in image_column_names:
   2251                     printed_sf[t] = self[t].astype(str)
-> 2252         return printed_sf.head(num_rows)
   2253 
   2254     def __str_impl__(self, num_rows=10, footer=True):

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in head(self, n)
   2454         tail, print_rows
   2455         """
-> 2456         return SFrame(_proxy=self.__proxy__.head(n))
   2457 
   2458     def to_dataframe(self):

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.head()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.head()

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

So I did a sf.fillna(c, 0) : 所以我做了一个sf.fillna(c, 0)

for c in sts.column_names():
    sts = sts.fillna(c, 0)

and it throws another error: 并引发另一个错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-e63cf73308dd> in <module>()
      1 for c in sts.column_names():
----> 2     sts = sts.fillna(c, 0)

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in fillna(self, column, value)
   5652             raise TypeError("Must give column name as a str")
   5653         ret = self[self.column_names()]
-> 5654         ret[column] = ret[column].fillna(value)
   5655         return ret
   5656 

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in fillna(self, value)
   2439 
   2440         with cython_context():
-> 2441             return SArray(_proxy = self.__proxy__.fill_missing_values(value))
   2442 
   2443     def topk_index(self, topk=10, reverse=False):

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Default value must be convertible to column type

How do I find specific rows that throws an error when saving in Graphlab SFrame? 在Graphlab SFrame中保存时,如何查找抛出错误的特定行?

And how do I fix this row? 以及如何解决这一行? Can I just replace the problematic columns in the rows with fillna() ? 我可以只用fillna()替换行中有问题的列吗? I can't really throw the rows away with dropna() since I need to keep track of the problematic rows. 我真的不能使用dropna()丢弃行,因为我需要跟踪有问题的行。

But even with dropna() , I end up with: 但是即使使用dropna() ,我最终还是:

sf.dropna()
sf.save('trainers.csv', format='csv')

How do I find these rows that gives me Errors or ZeroDivisionErrors? 我如何找到这些行给我Errors或ZeroDivisionErrors? And how to correct them or fill these columns with zeros? 以及如何更正它们或用零填充这些列?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-28-f82bcb3fa197> in <module>()
----> 1 sts.save('trainers.csv', format='csv')

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in save(self, filename, format)
   2924                 self.export_json(url)
   2925             else:
-> 2926                 raise ValueError("Unsupported format: {}".format(format))
   2927 
   2928     def export_csv(self, filename, delimiter=',', line_terminator='\n',

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

Strangely, I cannot iterate through the SFrame, when I try to iterate through the SFrame with: 奇怪的是,当我尝试使用以下方法遍历SFrame时,无法遍历SFrame:

for i in sf:
    print i

It throws this error: 它引发此错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-29-d2d0035d7bbe> in <module>()
----> 1 for i in sts:
      2     print i

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in generator()
   3712         def generator():
   3713             elems_at_a_time = 262144
-> 3714             self.__proxy__.begin_iterator()
   3715             ret = self.__proxy__.iterator_get_next(elems_at_a_time)
   3716             column_names = self.column_names()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.begin_iterator()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.begin_iterator()

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 10, in <lambda>
TypeError: 'NoneType' object is not iterable

It gets stranger, I couldn't retrieve a specific row with sf[num] but I can do a sub-SFrame and then retrieve that particular num row. 奇怪的是,我无法使用sf[num]检索特定行,但可以执行sub-SFrame然后检索该特定num行。 So this: 所以这:

print sf[25]

breaks and throws: 中断并抛出:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-62-6bc8898704c0> in <module>()
----> 1 print sts[25]

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in __getitem__(self, key)
   3595             ub = min(sf_len, lb + block_size)
   3596 
-> 3597             val_list = list(SFrame(_proxy = self.__proxy__.copy_range(lb, 1, ub)))
   3598             self._cache["getitem_cache"] = (lb, ub, val_list)
   3599             return val_list[key - lb]

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.copy_range()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.copy_range()

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 10, in <lambda>
TypeError: 'NoneType' object is not iterable

But when I try to extract a subset and then print, it works. 但是,当我尝试提取一个子集然后进行打印时,它可以工作。 The code below retrieves the 25th element that was previously error-throwing with the code above: 下面的代码检索以前与上面的代码一起抛出错误的第25个元素:

x =  sf[:30]
print x[25]

Is there a reason for why the previous code with sf[25] throws a NoneType ? 为什么前面带有sf[25]代码会抛出NoneType sf[0] to sf[24] works but anything above 25 didn't. sf[0]sf[24]有效,但高于25的任何操作均无效。

Appparently, iterating the SFrame this way and dumping it out as str sorta works: 显然,以这种方式迭代SFrame并在str sorta工作时将其转储出去:

fout = open('superbad.txt', 'w')
sflen = len(sf)
i = 0
while i < sflen:
    m = i+100 if i+100 < sflen else sflen
    x = sf[i:m]
    for j in x:
        fout.write(str(j) +'\n\n')

It's rather strange. 真奇怪 Why is that iterating in chunks and dumping to string works? 为什么要在块中进行迭代并转储到字符串中呢?

The issue is the division by zero error that you have when running an apply (somewhere above the save) 问题是运行应用程序时(除保存之外的某处)您遇到零除错误

RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

This happens because of lazy evaluation ( https://en.wikipedia.org/wiki/Lazy_evaluation ). 发生这种情况是由于延迟评估( https://en.wikipedia.org/wiki/Lazy_evaluation )。 As an example, suppose I start with an SFrame with a single column 例如,假设我从一个单列的SFrame开始

sf = gl.SFrame({'x': range(10000, -1, -1)})
sf['x'].apply(lambda x: 1.0/x)

At this point of time, the last row of the SFrame contains a 1.0/0 value which is an error, but this has not been evaluated yet. 此时,SFrame的最后一行包含一个1.0/0值,这是一个错误,但尚未对此求值。 The save method triggers a materialization ie an actual computation of all the rows in the data which then causes the error to happen. save方法触发实现,即对数据中所有行的实际计算,然后导致发生错误。 You can trigger this process using a call to __materialize__ 您可以通过调用__materialize__来触发此过程

sf.__materialize__()

which causes the following error to occur. 这将导致以下错误发生。

RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-55-5af90e232e2d>", line 1, in <lambda>
ZeroDivisionError: float division by zero

Lazy evaluation and query planning is really important as a performance optimization and is one of the reasons why the SFrame is fast and scalable. 懒惰的评估和查询计划对于性能优化确实非常重要,这也是SFrame快速且可扩展的原因之一。 Unfortunately, tracing errors is one of the annoyances of it, but you do get used to it once you are aware of how it works. 不幸的是,跟踪错误是它的烦恼之一,但是一旦您意识到它的工作原理,您就会习惯它。

The head() function does not trigger a full materialization so you can execute it on as many rows as you want until you find the error. head()函数不会触发完全实现,因此您可以根据需要在任意多行上执行它,直到找到错误为止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM