简体   繁体   中英

Parsing columns lines with a single double quotations in Graphlab.SFrame

I have lines as such from this dataset ( https://raw.githubusercontent.com/alvations/stasis/master/sts.csv ):

Dataset Domain  Score   Sent1   Sent2
STS2012-gold    surprise.OnWN   5.000   render one language in another language restate (words) from one language into another language.
STS2012-gold    surprise.OnWN   3.250   nations unified by shared interests, history or institutions    a group of nations having common interests.
STS2012-gold    surprise.OnWN   3.250   convert into absorbable substances, (as if) with heat or chemical process   soften or disintegrate by means of chemical action, heat, or moisture.
STS2012-gold    surprise.OnWN   4.000   devote or adapt exclusively to an skill, study, or work devote oneself to a special area of work.
STS2012-gold    surprise.OnWN   3.250   elevated wooden porch of a house    a porch that resembles the deck on a ship.

I have read it into a graphlab.SFrame using the read_csv() function:

import graphlab
sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str])

And there were lines that are not parsed. The traceback is as follows:

PROGRESS: Unable to parse line "STS2012-gold    MSRpar  3.800   "She was crying and scared,' said Isa Yasin, the owner of the store.    "She was crying and she was really scared," said Yasin."
PROGRESS: Unable to parse line "STS2012-gold    MSRpar  2.200   "And about eight to 10 seconds down, I hit. "I was in the water for about eight seconds."
PROGRESS: Unable to parse line "STS2012-gold    MSRpar  2.800   "It's a major victory for Maine, and it's a major victory for other states. The Maine program could be a model for other states."
PROGRESS: Unable to parse line "STS2012-gold    MSRpar  4.000   "Right from the beginning, we didn't want to see anyone take a cut in pay.  But Mr. Crosby told The Associated Press: "Right from the beginning, we didn't want to see anyone take a cut in pay."
PROGRESS: Unable to parse line "STS2014-gold    deft-forum  0.8 "Then the captain was gone. Then the captain came back."
PROGRESS: Unable to parse line "STS2014-gold    deft-forum  1.8 "Oh, you're such a good person! You're such a bad person!""
PROGRESS: Unable to parse line "STS2012-train   MSRpar  3.750   "We put a lot of effort and energy into improving our patching process, probably later than we should have and now we're just gaining incredible speed. "We've put a lot of effort and energy into improving our patching progress, p..."
PROGRESS: Unable to parse line "STS2012-train   MSRpar  4.000   "Tomorrow at the Mission Inn, I have the opportunity to congratulate the governor-elect of the great state of California.   "I have the opportunity to congratulate the governor-elect of the great state of California, and I'm lookin..."
PROGRESS: Unable to parse line "STS2012-train   MSRpar  3.600   "Unlike many early-stage Internet firms, Google is believed to be profitable.   The privately held Google is believed to be profitable."
PROGRESS: Unable to parse line "STS2012-train   MSRpar  4.000   "It was a final test before delivering the missile to the armed forces. State radio said it was the last test before the missile was delivered to the armed forces."
PROGRESS: 22 lines failed to parse correctly
PROGRESS: Finished parsing file /home/alvas/git/stasis/sts.csv
PROGRESS: Parsing completed. Parsed 19075 lines in 0.069578 secs.

Look at these lines there seems to be a problem if any of my Sent1 or Sent2 columns contains odd-numbered double quotation marks.

Using the error_bad_lines to track the problematic lines:

sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str],
                              error_bad_lines=True)

It throws the traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-a1ec53597af9> in <module>()
      1 sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', column_type_hints=[str, str, float, str, str],
----> 2                               error_bad_lines=True)

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in read_csv(cls, url, delimiter, header, error_bad_lines, comment_char, escape_char, double_quote, quote_char, skip_initial_space, column_type_hints, na_values, line_terminator, usecols, nrows, skiprows, verbose, **kwargs)
   1537                                   verbose=verbose,
   1538                                   store_errors=False,
-> 1539                                   **kwargs)[0]
   1540 
   1541 

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in _read_csv_impl(cls, url, delimiter, header, error_bad_lines, comment_char, escape_char, double_quote, quote_char, skip_initial_space, column_type_hints, na_values, line_terminator, usecols, nrows, skiprows, verbose, store_errors, **kwargs)
   1097                 glconnect.get_client().set_log_progress(False)
   1098             with cython_context():
-> 1099                 errors = proxy.load_from_csvs(internal_url, parsing_config, type_hints)
   1100         except Exception as e:
   1101             if type(e) == RuntimeError and "CSV parsing cancelled" in e.message:

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Unable to parse line "STS2012-gold MSRpar  3.800   "She was crying and scared,' said Isa Yasin, the owner of the store.    "She was crying and she was really scared," said Yasin."
Set error_bad_lines=False to skip bad lines

Is there a way to resolve this problem where my lines contains odd number of double quotes?

Is there a way to do it without cleaning the data (eg identifying the problematic lines and then clean/correct them but keep another SFrame to track the cleaning/correction?


As a sanity check, if we do a search \\t in the raw csv file, there's a tab in the rows that gives problem but when graphlab parses it, it disappears:

在此处输入图片说明


As another sanity check, reading the file line by line and splitting it by \\t returns 5 columns for the whole file:

alvas@ubi:~/git/stasis$ head sts.csv 
Dataset Domain  Score   Sent1   Sent2
STS2012-gold    surprise.OnWN   5.000   render one language in another language restate (words) from one language into another language.
STS2012-gold    surprise.OnWN   3.250   nations unified by shared interests, history or institutions    a group of nations having common interests.
STS2012-gold    surprise.OnWN   3.250   convert into absorbable substances, (as if) with heat or chemical process   soften or disintegrate by means of chemical action, heat, or moisture.
STS2012-gold    surprise.OnWN   4.000   devote or adapt exclusively to an skill, study, or work devote oneself to a special area of work.
STS2012-gold    surprise.OnWN   3.250   elevated wooden porch of a house    a porch that resembles the deck on a ship.
STS2012-gold    surprise.OnWN   4.000   either half of an archery bow   either of the two halves of a bow from handle to tip.
STS2012-gold    surprise.OnWN   3.333   a removable device that is an accessory to larger object    a supplementary part or accessory.
STS2012-gold    surprise.OnWN   4.750   restrict or confine place limits on (extent or access).
STS2012-gold    surprise.OnWN   0.500   orient, be positioned   be opposite.
alvas@ubi:~/git/stasis$ python
Python 2.7.10 (default, Jun 30 2015, 15:30:23) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('sts.csv') as fin:
...     for line in fin:
...             print len(line.split('\t'))
...             break
... 
5

>>> with open('sts.csv') as fin:
...     for line in fin:
...             assert len(line.split('\t')) == 5
... 
>>> 

Even more sanity check that it's the no. of columns, @papayawarrior example of the 4 columns line was correctly parsed in my version of graphlab :

在此处输入图片说明


I have manually checked the problematic lines and they're:

STS2012-gold    MSRpar  3.800   "She was crying and scared,' said Isa Yasin, the owner of the store.    "She was crying and she was really scared," said Yasin.
STS2012-gold    MSRpar  2.200   "And about eight to 10 seconds down, I hit. "I was in the water for about eight seconds.
STS2012-gold    MSRpar  2.800   "It's a major victory for Maine, and it's a major victory for other states. The Maine program could be a model for other states.
STS2012-gold    MSRpar  4.000   "Right from the beginning, we didn't want to see anyone take a cut in pay.  But Mr. Crosby told The Associated Press: "Right from the beginning, we didn't want to see anyone take a cut in pay.
STS2012-train   MSRpar  3.750   "We put a lot of effort and energy into improving our patching process, probably later than we should have and now we're just gaining incredible speed. "We've put a lot of effort and energy into improving our patching progress, probably later than we should have.
STS2012-train   MSRpar  4.000   "Tomorrow at the Mission Inn, I have the opportunity to congratulate the governor-elect of the great state of California.   "I have the opportunity to congratulate the governor-elect of the great state of California, and I'm looking forward to it."
STS2012-train   MSRpar  3.600   "Unlike many early-stage Internet firms, Google is believed to be profitable.   The privately held Google is believed to be profitable.
STS2012-train   MSRpar  4.000   "It was a final test before delivering the missile to the armed forces. State radio said it was the last test before the missile was delivered to the armed forces.
STS2012-train   MSRpar  4.750   "The economy, nonetheless, has yet to exhibit sustainable growth.   But the economy hasn't shown signs of sustainable growth.
STS2014-gold    deft-forum  0.8 "Then the captain was gone. Then the captain came back.
STS2014-gold    deft-forum  1.8 "Oh, you're such a good person! You're such a bad person!"
STS2015-gold    answers-forums      "Normal, healthy (physically, nutritionally and mentally) individuals have little reason to worry about accidentally consuming too much water.  It's fine to skip arm specific exercises if you are already happy with how they are progressing without direct exercises.
STS2015-gold    answers-forums  1.40    "The grass family is one of the most widely distributed and abundant groups of plants on Earth. As noted on the Wiki page, grass seed was imported to the new world to improve pasturage for livestock.
STS2015-gold    answers-forums      "God is exactly this Substance underlying who supports, exist independently of, and persist through time changes in material nature.    I'd argue that matter and energy are substances in the category of empirical scientific knowledge.
STS2015-gold    belief      "watching the first fight i saw that manny pacquiao was getting tired, and i wasn't.    at the same time, an asian summit is being held in a tourist resort.
STS2015-gold    belief      "global warming doesn't mean every year will be warmer than the last.   doesn't matter, that will just be obama's fault as well.
STS2015-gold    belief      "the only reason i'm not as confident that there's something about the birth certificate... the conventional view is that the us and ussr fought it out in the body of vietnam.
STS2015-gold    belief      "im not playing these bullshit games... if not get the hell out of there.
STS2015-gold    belief      "that oil is already contaminating our shoreline.   what point are you trying to relay?
STS2015-gold    belief      "we cannot write history with laws. "she's not sitting here" he said.
STS2015-gold    belief      the protest is going well so far.   our request is the same.
STS2015-gold    belief      "for over 20 years, i have illustrated the absurd with absurdity, three hours a day, five days a week.  for the first 1-2 years he hated me going out with my friends.

Instead of finding these lines manually by repeatedly cleaning out these lines from the PROGRESS: ... verbose message, is there a way to just dump these lines out when loading it into a Graphlab SFrame?

UPDATED ANSWER

Apologies to @alvas, I didn't see that a full dataset was linked in the original post. There are indeed five columns in all of the rows, and the problem does seem to be mismatched quotes. The SFrame CSV parser gets confused if there aren't matching quotes within a column, so the short answer is to change the quote character to something you know doesn't appear in the dataset.

import graphlab
sts = graphlab.SFrame.read_csv('sts.csv', delimiter='\t',
                                column_type_hints=[str, str, float, str, str],
                                quote_char='\0')

This successfully reads all 19,097 rows for me.

As an aside, there is also an SFrame.read_csv_with_errors method that will read the "good" lines into an SFrame, and collect the "bad" lines in an un-parsed SArray . That would let you keep track of the problematic lines in a programmatic way.

ORIGINAL ANSWER

Your data rows don't appear to contain any quotes, so this is not the problem. The problem is that you have 5 columns in some rows of data (and the header), but only 4 columns in other rows of data.

The first row has four columns:

STS2012-gold    surprise.OnWN   5.000   render one language in another language restate (words) from one language into another language.

while the second row has five:

STS2012-gold    surprise.OnWN   3.250   nations unified by shared interests, history or institutions    a group of nations having common interests.

To work around this I would call the SFrame csv parser twice , once for the four-column data, and once for the five-column data. Because the first fow has four columns, that one is a little more straightforward:

import graphlab
sts4 = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', header=True)

For the five-column data we have to skip the header and the first row, then rename the columns:

sts5 = graphlab.SFrame.read_csv('sts.csv', delimiter='\t', 
                                header=False, skiprows=2)
sts5 = sts5.rename({'X1': 'Dataset', 'X2': 'Domain', 'X3': 'Score',
                    'X4': 'Sent1', 'X5': 'Sent2'})

Then sts4 looks like

+--------------+---------------+-------+-------------------------------+
|   Dataset    |     Domain    | Score |             Sent1             |
+--------------+---------------+-------+-------------------------------+
| STS2012-gold | surprise.OnWN |  5.0  | render one language in ano... |
| STS2012-gold | surprise.OnWN |  4.0  | devote or adapt exclusivel... |
+--------------+---------------+-------+-------------------------------+

And sts5 is

+--------------+---------------+-------+-------------------------------+
|   Dataset    |     Domain    | Score |             Sent1             |
+--------------+---------------+-------+-------------------------------+
| STS2012-gold | surprise.OnWN |  3.25 | nations unified by shared ... |
| STS2012-gold | surprise.OnWN |  3.25 | convert into absorbable su... |
| STS2012-gold | surprise.OnWN |  3.25 | elevated wooden porch of a... |
+--------------+---------------+-------+-------------------------------+
+-------------------------------+
|             Sent2             |
+-------------------------------+
| a group of nations having ... |
| soften or disintegrate by ... |
| a porch that resembles the... |
+-------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM