简体   繁体   中英

UnicodeEncodeError while using spark-submit and BeautifulSoup

I keep getting a UnicodeEncodeError in Python 2.7 when I submit a job to spark 1.6, hadoop 2.7 but I do not get the same error when I am executing the same code line by line on the pyspark shell .

I am using BeautifulSoup to get all the tags and getting the text from them using this line of code:

[r.text for r in BeautifulSoup(line).findAll('ref') if r.text]

I have tried the following things:

  1. Set the export PYTHONIOENCODING="utf8"
  2. Use r.text.encode('ascii', 'ignore')
  3. Also tried to apply sysdefaultencoding('utf-8')

Could please someone tell me how to fix it? Below is the error stack:

"/hdata/dev/sdf1/hadoop/yarn/local/usercache/harshdee/appcache/application_1551632819863_0039/container_e36_1551632819863_0039_01_000004/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/harshdee/get_data.py", line 63, in get_as_row
    return Row(citations=get_citations(line.content), id=line.id, title=line.title)
  File "/home/harshdee/get_data.py", line 47, in get_citations
    refs_in_line = [r.text for r in BeautifulSoup(line).findAll('ref') if r.text]
  File "/usr/lib/python2.7/site-packages/bs4/__init__.py", line 274, in __init__
    self._check_markup_is_url(markup)
  File "/usr/lib/python2.7/site-packages/bs4/__init__.py", line 336, in _check_markup_is_url
    ' that document to Beautiful Soup.' % decoded_markup
  File "/usr/lib64/python2.7/warnings.py", line 29, in _show_warning
    file.write(formatwarning(message, category, filename, lineno, line))
  File "/usr/lib64/python2.7/warnings.py", line 38, in formatwarning
    s =  "%s:%s: %s: %s\n" % (filename, lineno, category.__name__, message)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 21-28: ordinal not in range(128)```

I solved the problem on my own. I think the problem was in the representation of the string.

For this, I used the repr function which returns the object representation. In other words, it basically returns a string which is uniformly encoded.

I applied this on the line variable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM