简体   繁体   中英

Beautiful Soup .find Chinese Characters

a_string = soup.find(text='围')

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

Is there anyway i can handle find with Chinese characters while using beautifulsoup?

Tried it for awhile , can't seem to detect the character. English character works fine

Source of the Website i'm working with

<!DOCTYPE html>
<html lang="zh-CN">
  <head>
        <meta charset="gbk" />

Try something like:

a_string = soup.find(text=re.compile(u'围', re.U))

In other words the searched string should be ensured to be unicode. It might work without re.compile() but at least make sure that your chinese string is enclosed within u''

When you use find(text='something') it will search for text nodes containing exactly the text 'something' and nothing else.

If you want to find a text that contains a particular letter, or match any other regular expression you must use regular expression pattern instead (like @Yannis said):

soup.find(text=re.compile(u'定'))

Note the the re.U flag is not required as you are not changing the behavior of special characters like \\s or \\w. If that would be the case, than you might need to provide it. See more on regular expressions here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM