简体   繁体   中英

Text Pattern Recognition Python

Consider you have a set of very noisy texts and you would like to pick each time a defined pattern, say \\d{3}(?:\\.||\\s)\\d{3} . The issue is, this pattern may occur in many contexts, like "443 440 $" , "923 140 €" , "923 140 EUR" , "product id 001 012" , "id prod. 001 012" , "product 001 012" in the same text or not.
As we see, the pattern is matching all these. For example:

text1 = "Here it is simple because my text includes only one regexp matching which is 443 440 ID"
text2 = "But in some other texts, the regexp can be corresponding to a product profit 956.000 EUR for the product ID 001 023"
text3 = "Also, it can be found that the product 001.079 has a profit of 900 000 $USD"
text4 = "It can be analyzed that the 001789 product contains 001 000 components"

Here I want to be sure that I am collecting the right thing : the product ID [443 440, 001 023, 001.079, 001789]

How would you deal with this ?

In real world, it can be found that some features may be helpful to decide whether or not the number is actually a product ID (position of the regexp in the text - generally at the beginning, constant discriminant words - EUR $, ...)

You can try this:

import re 
import itertools
text1 = "Here it is simple because my text includes only one regexp matching which is 443 440 ID"
text2 = "But in some other texts, the regexp can be corresponding to a product profit 956.000 EUR for the product ID 001 023"
text3 = "Also, it can be found that the product 001.079 has a profit of 900 000 $USD"
text4 = "It can be analyzed that the 001789 product contains 001 000 components"
s = [text1, text2, text3, text4]
final_ids = [re.findall('[\d\s\.]+(?=ID)|(?<=ID)\s*[\d\s\.]+|[\d\s\.]+(?=product)|(?<=product)\s*[\d\s\.]+', i) for i in s]
new_final_ids = [[re.sub('^\s+|\s+$', '', b) for b in i if re.findall('\d+', b)][0] for i in final_ids]

Output:

['443 440', '001 023', '001.079', '001789']

You can use http://regex.inginf.units.it/ to generate a regular expression based on example data. If you have a big enough training set, it should get the job done.

For your four examples, it generated this one: 001[^\\d]\\d++ Of course it is not working in all your cases but you might get a better result with more examples.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM