简体   繁体   English

如何解析此unicode字符串列表

[英]How to parser this unicode string list

I wan to parser this unicode string list into a table: 我想将这个unicode字符串列表解析为一个表:

[u'$760,507,625 (USA) (18 November 2010)', u'$760,505,847 (USA) (14 November 2010)', u'$760,462,559 (USA) (7 November 2010)', u'$760,410,799 (USA) (31 October 2010)',

So the result I want is: 所以我想要的结果是:

[[760507625, 11, 18, 2010, 'USA'], 
 [760505847, 11, 7, 2010, 'USA'],
  ....
]

As you can see, the format will be [money, month, day, year, country] 如您所见,格式为[货币,月,日,年,国家/地区]

Maybe you can provide me the tools which can handle this problem. 也许您可以向我提供可以解决此问题的工具。 Am I making me clear? 我要说清楚吗? Thanks so much! 非常感谢!

The usual way I would handle this would be with a regular expression to grab the fields out of each line, then a line or two for each field to convert it to the desired format. 我处理此问题的通常方法是使用正则表达式从每一行中获取字段,然后为每个字段添加一两行以将其转换为所需的格式。 This isn't foolproof—it'll crash if a line has an entry with a mis-spelled month, for example—but it's enough for most ad-hoc tasks. 这不是万无一失的,例如,如果某行中的某个条目的月份拼写错误,它就会崩溃,但是对于大多数临时任务而言,这已经足够了。

#!/usr/bin/env python2.7

import re

data = [u'$760,507,625 (USA) (18 November 2010)',
        u'$760,505,847 (USA) (14 November 2010)',
        u'$760,462,559 (USA) (7 November 2010)',
        u'$760,410,799 (USA) (31 October 2010)',
       'blah']

RE_DATA = re.compile(r'^\$([0-9,]+) \(([A-Z]+)\) \(([0-9]+) ([A-Za-z]+) ([0-9]+)\)$')

MONTHS = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}

for entry in data:
    match = RE_DATA.match(entry)
    if match is None:
        print 'Error! %r did not match pattern' % entry
        continue

    amount, country, day, month, year = match.groups()
    amount = int(amount.replace(',', ''))
    country = str(country)
    day = int(day)
    month = MONTHS[month]
    year = int(year)

    print [amount, month, day, year, country]

Prints: 打印:

[760507625, 11, 18, 2010, 'USA']
[760505847, 11, 14, 2010, 'USA']
[760462559, 11, 7, 2010, 'USA']
[760410799, 10, 31, 2010, 'USA']
Error! 'blah' did not match pattern

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM