简体   繁体   中英

beautifulsoup - ignore unicode error and only print text

I am doing a bit of web scraping, getting text from tables. Unicode errors keep popping up and when I encode to utf8 I get a bunch of b' and b'\\xc2\\xa0' mixed in with my results, is there a way to work around having to encode and only get texts from the tables?

Traceback (most recent call last): File "c:\...\...\...", line 15, in 
<module> print(rows) File 
"C:\...\...\...\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2612' in position 3: character maps to <undefined>

When I use replace, I get a the type error:

TypeError: a bytes-like object is required, not 'str' 

regardless if I use str() or not. I have attempted to iterate through and print only items that can be converted into strings but again unicode error pops up

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'

import re

import requests
from urllib.request import urlopen


from bs4 import BeautifulSoup

page = urlopen(test).read()
soup = BeautifulSoup(page, 'lxml')

tables = soup.findAll('table')

for table in tables:
  for row in table.findAll('tr'):
    for cel in row.findAll('td'):
      if str(cel.getText().encode('utf-8').strip()) != "b'\\xc2\\xa0'":
        print(str(cel.getText().encode('utf-8').strip())
        #print(str(cel.getText().encode('utf-8').strip().replace('\\xc2\\xa0', '').replace('b\'', '')

Actual results:

b'\xe2\x98\x92'
b'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

b'\xe2\x98\x90'
b'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

b'Washington'

b'\xc2\xa0'

b'91-1144442'

b'(State or other jurisdiction of\nincorporation or organization)'
...
...

Expected results:

'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

'Washington'

'91-1144442'

'(State or other jurisdiction of\nincorporation or organization)'

...
...

BeautifulSoup will already be correctly handling the HTML in utf-8 format, by encoding you are converting the string into bytes.

The following produced the required output:

from bs4 import BeautifulSoup
import requests

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
req = requests.get(test)
soup = BeautifulSoup(req.content, "html.parser")

for table in soup.find_all('table'):
    for row in table.findAll('tr'):
        for cel in row.findAll('td'):
            text = cel.get_text(strip=True)

            if text:   # skip blank lines
                print(text)

The HTML table could be stored as a list of lists as follows:

from bs4 import BeautifulSoup
import requests

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
req = requests.get(test)
soup = BeautifulSoup(req.content, "html.parser")

rows = []

for table in soup.find_all('table'):
    for row in table.findAll('tr'):
        values = [cel.get_text(strip=True) for cel in row.findAll('td')]
        rows.append(values)

print(rows)

Tested on:

Python 3.7.3, BS4 4.7.1
Python 2.7.16, BS4 4.7.1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM