使用Unicode分割字串，並使用Python反斜線

Question

我在從字符串中提取浮點數時遇到麻煩。 該字符串是webscraping的輸出：

input = u'<strong class="ad-price txt-xlarge txt-emphasis " itemprop="price">\r\n\xa3450.00pw</strong>'

我想得到：

output: 3450.00

但我沒有找到一種方法。 我嘗試使用split / replace函數將其提取：

word.split("\xa")
word.replace('<strong class="ad-price txt-xlarge txt-emphasis " itemprop="price">\r\n\xa','')

我試圖使用re庫。 它效果不佳，只能提取450.00

import re
num = re.compile(r'\d+.\d+')
num.findall(word)
[u'450.00']

因此，我仍然有與最終同樣的問題\\你有一個想法？

Answer 1

\\xa3是英鎊符號。

import unidecode 
print unidecode.unidecode(input)

<strong class="ad-price txt-xlarge txt-emphasis " itemprop="price">
PS450.00pw</strong>

要從中獲取數字，最好使用正則表達式：

import re
num = re.compile(r'\d+.\d+')
num.findall(input)[0]

結果

'450.00'

Answer 2

問題是\\xa3是unicode中的井號。 在執行split('\\xa')時，您嘗試將unicode字符split('\\xa') 。 您實際想要的輸出是450.00因為\\xa3450.00轉換為£450.00 。

str.split('\xa3')

應該可以在Python 3中工作。

注意： input是關鍵字。 除非您明確打算重新分配它，否則建議不要將其用作變量。

Answer 3

還有另一種可能的解決方案：

import re

x = u'<strong class="ad-price txt-xlarge txt-emphasis " itemprop="price">\r\n\xa3450.00pw</strong>'
print re.findall(r'\d+.\d*', x)

輸出：[u'450.00']

Answer 4

此代碼可以幫助您：

import requests 
from bs4 import BeautifulSoup

input = u'<strong class="ad-price txt-xlarge txt-emphasis " itemprop="price">\r\n\xa3450.00pw</strong>'
soup = BeautifulSoup(input)
# Find all script tags
for n in soup.find_all('strong'):
    # Check if the src attribute exists
    if 'src' in n.attrs:
        value = n['src']
        print value

我承認我沒有運行它，但是輸出應該是：

\\ r \\ n \\ xa3450.00pw

從這里您可以輕松提取價值。

Answer 5

input.encode('utf-8').split('\xa3')[1].split('pw')[0]

>> 450.00

Voilà

使用Unicode分割字串，並使用Python反斜線

問題描述

5 個解決方案

解決方案1
1 2016-06-23 10:03:21

解決方案2
0 2016-06-23 09:54:39

解決方案3
0 2016-06-23 10:12:59

解決方案4
-1 2016-06-23 10:03:22

解決方案5
-1 已采納 2016-06-23 10:07:50

使用Unicode分割字串，並使用Python反斜線

問題描述

5 個解決方案

解決方案1 1 2016-06-23 10:03:21

解決方案2 0 2016-06-23 09:54:39

解決方案3 0 2016-06-23 10:12:59

解決方案4 -1 2016-06-23 10:03:22

解決方案5 -1 已采納 2016-06-23 10:07:50

解決方案1
1 2016-06-23 10:03:21

解決方案2
0 2016-06-23 09:54:39

解決方案3
0 2016-06-23 10:12:59

解決方案4
-1 2016-06-23 10:03:22

解決方案5
-1 已采納 2016-06-23 10:07:50