简体   繁体   English

Python - 正则表达式“机器学习”

[英]Python - Regex “Machine Learning”

I have thousands of lines of text where I need to find money-representations eg: 我有数千行文字,我需要找到钱表示,例如:

Lorem ipsum dolor sit amet, 100.000,00 USD sadipscing elitr, sed diam nonumy eirmod 
GBP 400 ut labore et dolore magna aliquyam erat, sed diam voluptua. At USD 20 eos et 
accusam et justo duo dolores et 100,000.00 USD  ea rebum. Stet 3,-- USD gubergren, no 

The Python script should return the amount converted to USD. Python脚本应返回转换为USD的金额。 (eg 100000USF, 400 GBP -> USD, etc) (例如100000USF,400英镑 - >美元等)

What I did so far was manually creating Regular expressions for number - currency combinations to retreive the value, then compare the currency against a database and calculate the exchange. 到目前为止,我所做的是手动为数字 - 货币组合创建正则表达式以检索该值,然后将货币与数据库进行比较并计算交换。

However, this is neither efficient nor future proof (eg if another currency is added) So I'm wondering wether there is an efficient machine learning algorithm that I could "train" with some examples and it then tries to find sich "value - currency" combinations? 然而,这既不是有效的,也不是未来的证明(例如,如果添加另一种货币)所以我想知道是否有一个有效的机器学习算法,我可以用一些例子“训练”,然后它试图找到“价值 - 货币” “组合?

Can a human even learn if an acronym is a currency? 如果首字母缩写词是货币,人类甚至可以学习吗? if a new currency pops up then how is it distinguishable from any other arbitrary acronym? 如果弹出新货币,那么它与其他任意首字母缩略词的区别如何? Say you come across something like "1000 CPU", how could you tell if that is (or isn't) currency if you don't know what a CPU is? 假设你遇到像“1000 CPU”这样的东西,如果你不知道CPU是什么,你怎么知道这是(或不是)货币?

You could use natural language processing to look at the context around the number in question, but it's going to take more processing and you'll never know for sure. 你可以使用自然语言处理来查看有问题的数字周围的上下文,但它需要更多的处理,你永远不会知道。

My point is: for this problem machine learning is overkill, if it is even applicable. 我的观点是:对于这个问题,机器学习是否过度,如果它甚至适用。

Why do something the hard way when it is substantially easier and more accurate to do it another way? 为什么在以另一种方式更简单,更准确的情况下做一些艰难的事情呢?

Your problem is not well defined, but there is no need for machine learning. 您的问题没有明确定义,但不需要机器学习。 The set of possible currencies is finite and small, and the set of currency representations can not be so complicated as to not be expressible as a regular expression. 可能货币的集合是有限的和小的,并且货币表示的集合不能如此复杂以至于不能表达为正则表达式。 You simply are not employing the full power of regular expressions. 你根本就没有充分利用正则表达式。

For example, to match multiple currencies, use: 例如,要匹配多种货币,请使用:

    currency = r"((USD)|(GBP)(...))"

You can then express the number part of the representation 然后,您可以表达表示的数字部分

    numbers = r"([0-9]+[0-9\.,]*)"

Compile the regular expression: 编译正则表达式:

    matcher = re.compile(numbers+r"[\s]*+"currency)

You can create a second matcher that matches the currencies first. 您可以创建第二个匹配货币的匹配器。 You might be able to use something clever with optional capture groups and such, but I would recommend a simple second matcher if performance isn't a big issue. 您可以使用可选的捕获组等聪明的东西,但如果性能不是一个大问题,我会建议一个简单的第二个匹配器。

    matcher2 = re.compile(currency+r"[\s]*"+numbers)

Note that the 'currency' regex need not be created manually. 请注意,无需手动创建“货币”正则表达式。 Once you have a match, you can access the appropriate group number (1 or 3) to get the matched currency. 匹配后,您可以访问相应的组号(1或3)以获取匹配的货币。 For example: 例如:

    curren = m.group(1)
    amount = m.group(2)

This is possible since the entire 'currency' regex gets treated as a single group. 这是可能的,因为整个“货币”正则表达式被视为一个单独的组。

Unless there are infinite patterns of money representations in your input (probably impossible), your problem can definitely be tackled with appropriate regular expressions. 除非您的输入中存在无限的货币表示模式(可能是不可能的),否则您的问题肯定可以通过适当的正则表达式来解决。

I would just use regex to crudely extract possible pairs: 我会使用正则表达式粗略地提取可能的对:

import re

test = '''Lorem ipsum dolor sit amet, 100.000,00 USD sadipscing elitr, sed diam nonumy eirmod 
GBP 400 ut labore et dolore magna aliquyam erat, sed diam voluptua. At USD 20 eos et 
accusam et justo duo dolores et 100,000.00 USD  ea rebum. Stet 3,-- USD gubergren, no'''

number = r'([\d+.,]+)'
currency = r'([A-Z]{2,3})'

r1 = re.compile(number + r'\s+' + currency)
r2 = re.compile(currency + r'\s+' + number)

matches = r1.findall(test) + r2.findall(test)

print(matches)

I get: 我明白了:

[('100.000,00', 'USD'), ('100,000.00', 'USD'), ('GBP', '400'), ('USD', '20')]

From there, you can parse the numbers and filter out currencies that don't exist. 从那里,您可以解析数字并过滤掉不存在的货币。 You've got only five or six possible formats, so there's really nothing machine learning can do for you here. 你只有五六种可能的格式,所以机器学习对你来说真的没什么用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM