正则表达式不匹配用 soup.get_text() 抓取的文本

Question

The code below works until:下面的代码一直有效到：

print(salary_range)

This is the code:这是代码：

url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
salaries = soup.find_all("h4", class_="tw-mb-0")
markup2 = str(salaries[0])
soup2 = BeautifulSoup(str(salaries[0]), 'html.parser')

salary_range = soup2.get_text().strip()
print(salary_range) #output: "10 000  – 16 000  PLN"

# error on line below
bottom_salary = re.search(r"^(\d{0,2} ?\d{3})", salary_range).group(1)
print(bottom_salary)

bottom_salary_int = re.sub(" ", "", bottom_salary)
print(bottom_salary_int)

Why doesn't re.search() find any match?为什么re.search()找不到任何匹配项？ I've tried many other regular expressions, but it never finds a match and I always get the error AttributeError: 'NoneType' object has no attribute 'group'我尝试了许多其他正则表达式，但它从未找到匹配项，而且我总是收到错误AttributeError: 'NoneType' object has no attribute 'group'

Answer 1

The issue is that the character you think is a space is not actually a space, it is a non-breaking space .问题是您认为是空格的字符实际上不是空格，而是不间断的空格。 Despite looking the same, they are completely different characters.尽管看起来一样，但它们是完全不同的角色。 It has the same function of a regular space, but it doesn't count for line wrapping purposes.它具有相同的 function 常规空间，但不计入换行目的。 See this small diagram:看这个小图：

10 000  – 16 000  PLN
  ^   ^^
 NBSP SP  ... same deal here

To match the non-breaking space instead, specify its hex value, 0xA0 .要改为匹配不间断空格，请指定其十六进制值0xA0 。 Like this:像这样：

from bs4 import BeautifulSoup
import re
import requests

url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
salaries = soup.find_all("h4", class_="tw-mb-0")
markup2 = str(salaries[0])
soup2 = BeautifulSoup(str(salaries[0]), 'html.parser')

salary_range = soup2.get_text().strip()
print(salary_range)

bottom_salary = re.search(r"^(\d{0,2}\xa0?\d{3})", salary_range).group(1)
print(bottom_salary)

bottom_salary_int = re.sub(" ", "", bottom_salary)
print(bottom_salary_int)

If you're trying to match a space, but the regular space character doesn't match, then it might be a NBSP instead.如果您尝试匹配空格，但常规空格字符不匹配，则它可能是 NBSP。 You can also tell by the website's source code if it uses  您还可以通过网站的源代码来判断它是否使用  instead of a regular space to encode a NBSP.而不是常规空间来编码 NBSP。

Answer 2

Just in addition, if you prefer a less explicit definition of a character ( non-breaking space ), simply change the pattern to (\d+.\d+) or (\d+\s\d+) to get your group, also ^ is not needed in this specific case:另外，如果您更喜欢不太明确的字符定义（不间断空格），只需将模式更改为(\d+.\d+)或(\d+\s\d+)即可获得您的组， ^也是在这种特定情况下不需要：

. Matches any character.匹配任何字符。

re.search(r"(\d+.\d+)", e.get_text()).group(1)

\s Matches any space, tab or newline character. \s匹配任何空格、制表符或换行符。

re.search(r"(\d+\s\d+)", e.get_text()).group(1)

To fix the spacing simply split() and join() :要修复间距，只需split()和join() ：

''.join(re.search(r"(\d+.\d+)", e.get_text()).group(1).split())

Example例子

import requests, re
from bs4 import BeautifulSoup

url = "https://nofluffjobs.com/pl/job/c-c-junior-software-developer-vesoftx-wroclaw-n6bgtv5f"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.content)
for e in soup.find_all("h4", class_="tw-mb-0"):
    print(''.join(re.search(r"(\d+.\d+)", e.get_text()).group(1).split()))

Output Output

10000
9000

正则表达式不匹配用 soup.get_text() 抓取的文本

问题描述

2 个解决方案

解决方案1
1 已采纳 2023-01-24 23:48:28

解决方案2
1 2023-01-25 08:19:25

Example例子

Output Output

正则表达式不匹配用 soup.get_text() 抓取的文本

问题描述

2 个解决方案

解决方案1 1 已采纳 2023-01-24 23:48:28

解决方案2 1 2023-01-25 08:19:25

Example例子

Output Output

解决方案1
1 已采纳 2023-01-24 23:48:28

解决方案2
1 2023-01-25 08:19:25