简体   繁体   中英

how to properly extract utf8 text (japanese symbols) from a webpage with BeautifulSoup4

i downloaded webpages using wget. now i am trying to extract some data i need from those pages. the problem is with the Japanese words contained in this data. the English words extraction was perfect.

when i try to extract the Japanese words and use them in another app they appear gibberish. during testing diffrent methods there was one solution that fixed only half the japanese words.

what i tried: i tried

from_encoding="utf-8" 

which had no effect. also i tried multiple ways to extract the text from the html code like

section.get_text(strip=True) 
section.text.strip()

and others, also i tried to encode the generated text using URLencoding which did not work, also i tried using every code i could find on stackoverflow

one of the methods that strangely worked (but not completely) was saving the string in a dictionary then saving it into a JSON then calling the JSON from ANOTHER script. just using the dictionary, as it is, would not work. i have to use JSON as a middle man between two scripts. strange. (not all the words worked)

my question may seem like duplicates of anther question. but that other question is scraping from the internet. and what i am trying to do is extract from an offline source.

here is a simple script explaining the main problem

from bs4 import BeautifulSoup

page = BeautifulSoup(open("page1.html"), 'html.parser', from_encoding="utf-8")
word = page.find('span', {'class' : "radical-icon"})
wordtxt = word.get_text(strip=True)
  
#then save the word to a file
    
with open("text.txt", "w", encoding="utf8") as text_file:
    text_file.write(wordtxt)

when i open the file i get gibberish characters

here is the part of the html that BeautifulSoup searchs:

<span class="radical-icon" lang="ja">亠</span>

the expected results is to get the symbols inside the text file. or to save them properly in anyway.

is there a better web scraper to use to properly get the utf8?

PS: sorry for bad english

i think i found an answer, just uninstall beautifulsoup4. i dont need it.

python has a builtin way to search for strings, i tried something like this:

import codecs
import re

with codecs.open("page1.html", 'r', 'utf-8') as myfile:
    for line in myfile:
        if line.find('<span class="radical-icon"') > -1:
            result = re.search('<span class="radical-icon" lang="ja">(.*)</span>', line)
            s = result.group(1)

with codecs.open("text.txt", 'w', 'utf-8') as textfile:
    textfile.write(s)

which is the over complicated and non-pythonic way of doing it. but what works works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM