简体   繁体   中英

How to extract information from the following text?

I'm trying to extract title, description & address from text of different websites. I'm currently doing some web crawling which extracts the information stated above. However, I am having trouble coming up with a regular expression that matches to the expected text output that I want below.

Can I know how can I improve my regular expression and embed the suggested set of rules to meet and extract the information above?

My Regex:

(^.+\n)(^.+\n)?(^\d+.*\d{6})

Set of rules to embed:

First line (title)
    - can contain any alphabets and numbers
    - should not contain dot(.)
Second line (description or additonal information)
    - can contain any alphabets and numbers
    - should contain dot(.)
    - second line can be empty
    - if its empty then extract the first line which is the title
Third line (address)
    - address extraction

Input Text:

View store information
TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284
67817232
Open Now
Full Menu
View store information
THE SIGNATURE
The SIGNATURE is a wonderful destination for shopping text.
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066
65883667
Open Now
Full Menu
Jewel Changi Airport
Jewel Changi Airport is a breath-taking place for families text.
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666

Expected Text Output: (Ideally)

TAMPINES MART
11559.33Km Away,
5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284

THE SIGNATURE
11559.97Km Away,
51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066

Jewel Changi Airport
78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666

One option is to match words using \\w and repeat the first capturing group to get the value of the last iteration as the title.

^(\w+(?: \w+)*\r?\n)*(?:(?![^.\r\n]*\.|.*\d{6}).*\r?\n)*(?:([^\r\n.]*\..*(?:\r?\n(?!.* \d{6}).*)*)\r?\n)?(.* \d{6}(?:\r?\n(?![A-Z]).*)*)$

Regex demo

 const regex = /^(\\w+(?: \\w+)*\\r?\\n)*(?:(?![^.\\r\\n]*\\.|.*\\d{6}).*\\r?\\n)*(?:([^\\r\\n.]*\\..*(?:\\r?\\n(?!.* \\d{6}).*)*)\\r?\\n)?(.* \\d{6}(?:\\r?\\n(?![AZ]).*)*)$/mg; const str = `View store information TAMPINES MART 11559.33Km Away, 5 TAMPINES ST 32, #01-07/16 TAMPINESS MART, 529284 67817232 Open Now Full Menu View store information THE SIGNATURE The SIGNATURE is a wonderful destination for shopping text. 51, CHANGI BUSINESS PARK CENTRAL 2, #01-15, THE SIGNATURE, 486066 65883667 Open Now Full Menu Jewel Changi Airport Jewel Changi Airport is a breath-taking place for families text. 78 Airport Boulevard, #B2-275-277 Jewel Changi Airport, Singapore, 819666`; let m; while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } console.log("Title: " + m[1]); if (undefined !== m[2]) { console.log("Description: " + m[2]); } console.log("Address: " + m[3]); console.log("\\n") }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM