簡體   English   中英

Javascript正則表達式匹配一些字符串,但在其他看似相同的字符串上失敗

[英]Javascript regular expression matches some strings, but fails on other seemingly identical strings

的jsfiddle

我使用Facebook的API從我縣的警察部門頁面中提取每日犯罪報告。 它們遵循一種大多數標准化的格式,下面的模式就是我要解決的問題,以及一些惱人的不一致性:

  1. 標題在3-4行之間,后跟兩個新行字符\\n\\n (代碼將其剪切掉,不屬於下面的輸出)
  2. 所犯罪行的不同類別歸為一類,第一行是描述犯罪類型的大寫字符串。 每個類別由其上方的兩個新行字符\\n\\n分隔。
  3. 實際犯罪行為遵循上述類別標題,每個(大部分時間)由一個新行字符\\n分隔
  4. 作為他們復制和粘貼的任何“神器”,有幾次用連字符代替連字符,包括\–\—\―
  5. 報告的所有犯罪都以字符串“BEAT”開頭,或者在極少數情況下以“Beat”開頭

我遇到的問題是,有時下面的代碼會捕獲上面#2中詳述的類別標題,但在其他帖子中,(看似)完全相同的字符串和環境無法捕獲。 我在服務中使用的角度代碼如下所示

me.parsePosts = function() {
    var posts = facebookService.getRandomPosts(); // Just a method to return 5 random reports for now
    angular.forEach(posts, function(post) {
        // Some reports are incorrectly double spaced and inconsistent
        // with spacing and capitalization
        var fixedPost = post.message
                            .replace(/^Beat/, 'BEAT') // They were a little inconsistent back in the day
                            .replace('\n\n###', '') // All posts end with a useless ###
                            .replace('\u2013', '-') // Pesky unicode characters!
                            .replace('\u2014', '-')
                            .replace('\u2015', '-')
                            .replace('\n\nARRESTED', '\nARRESTED') // would help if this was consistent
                            .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...
            postSplit = fixedPost.split('\n\n'), // split up the post into potential categories
            header = postSplit.splice(0,1); // I don't want the standard header of the post

        // Pass in postSplit .join()'d back together for debugging
        me.getCategoriesFromPost(postSplit, postSplit.join('\n\n'));
    });
};

me.getCategoriesFromPost = function(postArray, post) {
    var categoryRegexp = /[A-Z\-&\/: ]+$/,
        categories = [], uniqCategories = [];

    angular.forEach(postArray, function(a) {
        var split = a.split('\n'), // Extract the category from the list of crimes
            potentialCategory = split[0].trim(); // There's often an unwanted trailing space

        if (potentialCategory.match(categoryRegexp)) {
            categories.push(potentialCategory);
        }
    });

    // Every blue moon they repost a category twice, I just want one
    // and I'll merge the two together afterwards
    uniqCategories = categories.filter(function(a,b) {
        return categories.indexOf(a) == b;
    });

    console.log(uniqCategories); // log off all the categories in the post
    console.log(post); // Display the actual post so i can visibly verify it all worked
};

舉個例子,在一篇文章中:

console.log(uniqCategories); facebookService.getRandomPosts()收到的原始原始文本 ):

BURGLARY COMMERCIAL
BEAT E1 SPRINT WIRELESS, 7300 ASSATEAGUE DR, 3/19 0426: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole electronics. 14-25638
BEAT D6 MONTPELIER LIQUORS, 7500 MONTPELIER RD, 3/19 0513: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole liquor, lottery tickets, and an ATM machine. 14-25641
BEAT D4 MACY’S, 10300 LITTLE PATUXENT PKWY, 3/19 0501: Two unknown male suspects, wearing masks, gained entry to the business by breaking the glass door. The suspects were interrupted by a store employee and fled without taking anything. 14-25642
SUSPECT VEHICLE: black Dodge pickup 

BURGLARY NON COMMERCIAL
BEAT B3 6600 ASPERN DR, 3/17 2354: Four suspects gained entry to the residence via unknown means. No sign of forced entry. 14-25220 
ARRESTED:
Karlin Lamont Harris, 23, of Pirch Way in Elkridge, charged with fourth-degree burglary
Steven Lee Hubbard, 29, of Edgewater, charged with fourth-degree burglary
Jessie Tyler Holt, 22, of Pine Tree Rd in Jessup, charged with fourth-degree burglary
Brittney Victoria McEnaney, 26, of Pasadena, charged with fourth-degree burglary
BEAT C1 6900 BENDBOUGH CT, 3/18 1400: Unknown suspect(s) gained entry to the residence via the front door. No sign of forced entry. The suspect(s) stole jewelry. 14-25392
BEAT B4 7100 DEEP FALLS WAY, 3/18 1100-1440: Unknown suspect(s) gained entry to the residence by forcing a rear basement window. The suspect(s) stole jewelry and electronics. 14-25404 

VEHICLE THEFT & ATTEMPTS
BEAT E2 7-11, 9600 WASHINGTON BLVD, 3/18 0409: 
05 Acura Tag 1AV8629 14-25277 (Keys left in vehicle.)

console.log(post); 回報

["BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

然而在另一篇文章中, console.log(uniqCategories); facebookService.getRandomPosts()收到的原始原始文本 ):

ROBBERY COMMERCIAL
BEAT B3 ZIPS DRY CLEANING, 6500 OLD WATERLOO RD, 3/22 1900: An unknown suspect entered the business through an unlocked rear door. The suspect threatened an employee and demanded cash. The employee complied. The suspect fled the business. 14-26959 
SUSPECT: B/M, 5’8-5’9, black hoodie and pants, backpack 

ROBBERY NON COMMERCIAL
BEAT E7 7-11 PARKING LOT, 9100 MAIER RD, 03/23 1632: Suspect stole cash from an acquaintance and caused an abrasion with an unknown sharp object. Police are investigation the possibility it may be drug related. 14-27243 
SUSPECT: B/M, 5’8, 200 lbs, dreadlocks

BURGLARY COMMERCIAL
BEAT E1 MEGATELECOM, 8600 WASHINGTON BLVD #106, 3/22 0933: Unknown suspect(s) gained entry to the business by breaking a window. The suspect(s) stole electronics. 14-26793
BEAT F3 CATTAIL CREEK COUNTRY CLUB, 3600 CATTAIL CREEK DR, 03/22 1600- 03/23 0630: Unknown suspect(s) gained entry to a garage through an unlocked door. The suspect(s) stole golf carts. 14-27127

BURGLARY NON COMMERCIAL
BEAT E2 9300 BREAMORE CT, 03/21 1210 ATTEMPT: Two suspects attempted to gain entry via a rear slider. The resident yelled and the suspects fled, but were later caught by police. 14-26458
ARRESTED:
Travis Donte Mackell, 23, of Baltimore, charged with fourth-degree burglary
Maurice Debuiel Aye, 26, of Baltimore, charged with fourth-degree burglary
BEAT D3 5500 COLUMBIA RD, 3/21: An unknown suspect gained entry to the residence through an unlocked rear slider. The suspect woke the resident, who ultimately got the suspect to leave. It appears he may have entered the wrong residence. 14-26712 
SUSPECT: B/M, 5’8, 200 lbs
BEAT B4 7500 HEARTHSIDE WAY, 3/22 1700- 1800: Three unknown black male suspects stole a bicycle, which was unsecured on a bike rack. 14-27185
BEAT E3 9100 BRYANT AVE, 3/23 2213: Unknown suspects gained entry to the residence by prying open the kitchen window. Nothing appeared to be taken. 14-27308
BEAT B3 8000 KEETON RD, 3/23 1930- 2230: Unknown suspect(s) gained entry to the residence through an unlocked window. The suspect(s) stole a computer and jewelry. 14-27314
BEAT A3 9000 FREDERICK RD, 3/23 0205: The suspect kicked in an acquaintance’s door after a verbal altercation and assaulted him. 14-27361 
ARRESTED: Michael Wilson Sittig, 34, of Frederick Road in Ellicott City, charged with second-degree assault, third- and fourth-degree burglary, malicious destruction of property, and disorderly conduct

VEHICLE THEFT & ATTEMPTS
BEAT D2 5100 ELIOTS OAK DR, 03/22 2130- 3/23 0700: 
12 Hyundai Sonata Red MD 5AN2945 14-27135

console.log(post)只返回:

["ROBBERY COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

我希望它能夠歸還["ROBBERY COMMERCIAL", "ROBBERY NON COMMERCIAL", "BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

在那種情況下,很明顯我的代碼與BURGLARY COMMERCIALBURGLARY NON COMMERCIAL的前一個實例匹配,但后者不匹配。 是什么賦予了? 另外,請隨意糾正我並告訴我,我對.replace()的牆做錯了,並且有更好的方法,如果有的話。 非常感謝幫助!

String.replace替換FIRST事件。 您需要使用正則表達式更改所有String.replace以替換所有出現的內容。 像這樣的東西(雖然我不確定unicode字符在正則表達式中是如何工作的):

post.message
  .replace(/^Beat/ig, 'BEAT') // They were a little inconsistent back in the day
  .replace('/\n\n###/g', '') // All posts end with a useless ###
  .replace('/\u2013/g', '-') // Pesky unicode characters!
  .replace('/\u2014/g', '-')
  .replace('/\u2015/g', '-')
  .replace('/\n\nARRESTED/g', '\nARRESTED') // would help if this was consistent
  .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...

在拆分之前,您錯過了一些分隔符替換。 即,我補充說:

post.message
...
.replace( /\s*\n\s\n/g, '\n\n')
.replace(/\s BEAT/g, 'BEAT') ... 

看到更新的小提琴

TL; DR; (根據評論更新)

如果你看看原始replace(...)函數調用之后的消息,以及.split('\\n\\n') ,它們中的一些在最后有一個空格,后跟換行符,然后是另一個空白和換行符。

你的原始replace()都沒有處理這個問題。 另外,有些只有換行符,空行,換行符模式(為什么正則表達式中的第一個空格有* )。 然后,消息中的一些BEAT關鍵字前面有一個或多個空格,因此我們刪除它們以確保BEAT始終以換行符開頭。

如果您取消注釋掉小提琴中的日志記錄行並注釋掉修復,您將在每個步驟中看到元素數組。

在其中一個中,您將看到一個數組元素不僅包含我們期望的內容(一個報告),而且還包含下一個類別(這就是為什么您會看到更少的數據)。

然后我試着看看那些行結尾有什么不同,並檢查replace()函數是否在split(...)調用之前處理它們...

如果您希望我更好地解釋,請告訴我。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM