在列表中分割元素

Question

我有一個輸入字符串：

“ [u'799,900'，u'1,698,000'，u'998,000'，u'1,299,000'，u'1,000,000'，u'499,950'，u'$ 995,000'，u'$ 998,000'，u'$ 2,000,000'，u'988,000 '，u'979,000'，u'1,285,000'，u'$ 988,000'，u'$ 579,000'，u'$ 700,000'，u'$ 1,100,000'，u'$ 1,557,000'，u'$ 999,888'，u'$ 798,000'，u'$ 998,000 '，u'1,050,000'，u'$ 888,000'，u'$ 559,888'，u'$ 774,900'，u'$ 795,000'，u'$ 850,000']“，” [u'3 bds'，u'2 ba'，u' 1,361平方英尺'，u'4 bds'，u'3 ba'，u'2,845平方英尺'，u'3 bds'，u'3 ba'，u'1,534平方英尺'，u'3 bds'，u'2 ba '，u'1,762平方英尺'，u'5 bds'，u'3 ba'，u'2,398平方英尺'，u'2 bds'，u'2 ba'，u'956平方英尺'，u'4 bds'， u'3 ba'，u'1,840平方呎，u'3 bds'，u'2 ba'，u'1,212平方呎，u'3 bds'，u'3 ba'，u'1,878平方呎，u' 3 bds'，u'2 ba'，u'1,240平方尺，u'3 bds'，u'2 ba'，u'1,207平方尺，u'3 bds'，u'3 ba'，u'1,905平方尺'，u'3 bds'，u'3.5 ba'，u'1,591平方英尺'，u'2 bds'，u'2 ba'，u'946平方英尺'，u'2 bds'，u'2 ba'， u'1,067平方呎，u'4 bds'，u'3平方呎，u'2,254平方呎，u'5 bds'，u'4平方呎，u'2,744平方呎 '，u'3 bds'，u'3 ba'，u'1,291平方英尺'，u'4 bds'，u'3 ba'，u'1,480平方英尺'，u'3 bds'，u'2 ba'， u'1,513平方英尺'，u'4 bds'，u'2 ba'，u'1,846平方英尺'，u'9 bds'，u'5 ba'，u'3,336平方英尺'，u'2 bds'，u' 2 ba'，u'983平方呎，u'4 bds'，u'3 ba'，u'1,476平方呎，u'3 bds'，u'3 ba'，u'1,872平方呎，u'2 bds '，u'3英尺'，u'1,459平方英尺']“

從中，我需要將價格提取到int列表中。

到目前為止，這是我嘗試過的：

import re

pattern_price = r'\[u\'\$.*?\]'
patternx = r"(.*?u.*?)(\d+\,\d+\,\d+|\d+\,\d+)"

with open(fpath, "r") as f:
    for line in f.readlines():
        lst = re.findall(pattern_price, line)      

    print len(lst) # I get list with 1 element?

    newlst = [x.split(patternx) for x in lst]
    print len(newlst) # I got 1 element again?

回答類似問題並沒有幫助我： Link1 Link2

Answer 1

您的代碼中有幾個問題。

創建一個將保存值的變量

與您當前的問題無關，但是如果您想擴展自己的解決方案，則很重要：

您正在遍歷文件行，但沒有保留一個變量來保存您所經歷的值。

是的，您正在創建一個列表，但是該列表會在for循環中為每一行重新創建。

因此，您將只獲得文件的最后一行 ，而其他文件未處理。

要解決此問題，請在循環之前添加一個變量並將其添加。

with open(fpath, "r") as f:
    lst = []
    for line in f.readlines():
        lst.append( ... )

價格模式

您正在捕獲包含價格的字符串的整個部分 。 這就是為什么您只獲得1個匹配項，而不是每個價格獲得1個匹配項的原因。

要僅捕獲價格，可以使用以下正則表達式：

'''
\$             # Make sure the numbers start with dollar sign (Has to be escaped as it is special sign)
(              # Start capturing group, this is what we want as output
    [\d,]      # Match either a digit (0-9) or a comma ','
    {7,11}     # Match the previous expression 7 to 11 times, getting '100,000' up to '100,000,000'
)              # End the capturing group
'''

通過正則表達式分割字符串

您正在嘗試使用正則表達式對字符串進行拆分：

x.split(patternx)

這樣做是因為它使用了正則表達式，因為它是分隔符字符串而不是正則表達式。

因此，它只是將子字符串與string進行比較，沒有找到任何匹配項，而只是將整個字符串返回。

您應該改用re.split 。

從字符串中提取數字

最后，剩下的字符串必須轉換為數字並將其添加到列表中。

為此，您必須遍歷re.findall返回的列表，擺脫逗號並將其轉換為int。

prices = re.findall(pattern, line)
    for price in prices:
        number = int(price.replace(',', ''))
        lst.append(number)

最終代碼

import re

pattern = r'\$([\d,]{7,11})'

with open(fpath, "r") as f:
    lst = []
    for line in f.readlines():
        prices = re.findall(pattern, line)
        for price in prices:
            number = int(price.replace(',', ''))
            lst.append(number)
    print lst

在列表中分割元素

問題描述

1 個解決方案

解決方案1
2 已采納 2016-07-05 09:21:25

創建一個將保存值的變量

價格模式

通過正則表達式分割字符串

從字符串中提取數字

最終代碼

在列表中分割元素

問題描述

1 個解決方案

解決方案1 2 已采納 2016-07-05 09:21:25

創建一個將保存值的變量

價格模式

通過正則表達式分割字符串

從字符串中提取數字

最終代碼

解決方案1
2 已采納 2016-07-05 09:21:25