紅寶石正則表達式和多行字符串

Question

我正在使用蝦寶石通讀60頁的計算機生成的pdf報告，其中包含數十個人的財務和人口統計數據。 我面臨的挑戰是，我希望能夠在掃描每行時捕獲名稱/特殊ID（在同一行上）以及與該人相關的后續行。 使用ruby的字符串掃描方法，我已經能夠以這種方式捕獲每個匹配返回行的財務信息：

[<invoice no.>, <service type>, <modifier (if any)>, <service_date>, <units>, <amount>]

我已經嘗試將ID與財務數據關聯了幾行，然后每當ID更改但沒有任何作用時都將其更改。 我會以這種方式向后退嗎？ 我對regex的經驗很少（和一般的編程習慣）。

以下是僅適用於財務數據的代碼：

PDF::Reader.new(file).pages.each do |page|
  page.raw_content.scan(/^\(\s(\d{6})\s+\d\s+(\w\d{4})\s+(0580|TT|1C|1C\s+1F)?\s+(\d+\/\d+\/\d+)\s+\d+\/\d+\/\d+\s+(\d+\.\d+)\s+(\d+\.\d+)/) do |line|        
    line.collect {|x| x.strip! if !x.nil?}
    print "#{line.join(' ')}\n"
    Cycle.check_details(line)
  end
end

這是puts page.raw_content生成內容的puts page.raw_content （這些行中包含很多空白）。

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JAIME         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  887.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
   434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JOFFREY         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  259.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
   434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

Answer 1

並不是所有的東西都可以用正則表達式解析。 而且，有時在將數據分成可管理的塊之后，正則表達式很有用。 您的數據是第二種情況的示例。 一旦將其分解，就可以輕松解析各個行。

您的數據令人困惑，但這會弄亂數據。 除去slice_before (和結尾)'后，代碼使用split將它分成幾行，然后使用slice_before將其分成邏輯塊。 一旦收集到這些，就可以以明智的方式處理每個塊：

require 'prettyprint'

data = "(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JAIME         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  887.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
  434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'

(REG  LOC   CLIENT   SERVICE   NAME                    BIRTH DATE   RECIPIENT ID    PRIOR AUTHORIZATION #)'
(xx   xxx  xxxxx     xxxxxxx    LANNISTER, JOFFREY         xx/xx/xxxx   xxxx <special ID>)'
(DIAGNOSIS CODES:  259.0)'
( )'
(  INV #   LINE #   PROCEDURE CODE  REVENUE CD   FROM DT   THRU DT     UNITS AMOUNT)'
( <inv num>       1    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       2    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     2.50     41.00)'
( <inv num>       3    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       4    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       5    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       6    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
( <inv num>       7    <service_code>  <modifier>                    xx/xx/13  xx/xx/13     4.00     65.60)'
(                                                                CLAIM TOTAL
  434.60   CLAIM ACCOUNT REF.  xxxxxxxxxxxxxxxSUP)'
"

lines = data.gsub(/^\(|\)'$/m, '').split("\n").map{ |s| s.strip }.reject{ |s| s.empty? }.slice_before(/^REG\b/)

在這一點上， lines是一個數組數組。 每個子數組由以“ REG”開頭的行塊組成。 每次slice_before看到與/^REG\\b/匹配的新行，它都會創建一個新的子數組/塊。 lines是一個枚舉器，類似於從哈希獲取數組或單個鍵/值對之前的初步對象。 您可以遍歷枚舉器，這是我們想要做的：

patient_data = lines.map { |sub_ary|
  sub_ary[1][/(?:\S+ \s+ ){4} (\S+, \s+ \S+) \s+ (?:\S+ \s+){2} (.+)$/x]
  patient_name, special_id = $1, $2

  invoice_info = sub_ary[5..-3].map{ |line|
    line[/^(\S+) \s+ \S+ \s+ (\S+) \s+ (\S+)/x]
    [$1, $2, $3]
  }

  {
    patient_name: patient_name,
    special_id:   special_id,
    invoice_info: invoice_info
  }
}

pp patient_data

哪個輸出：

[{:patient_name=>"LANNISTER, JAIME",
  :special_id=>"<special ID>",
  :invoice_info=>
  [["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"]]},
{:patient_name=>"LANNISTER, JOFFREY",
  :special_id=>"<special ID>",
  :invoice_info=>
  [["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"],
    ["<inv_num>", "<service_code>", "<modifier>"]]}]

這使您接近，但不能完全解決問題。 我特意留給您，以弄清楚如何修改代碼以從記錄中獲取所需的所有字段。

Answer 2

如果您想測試您的正則表達式，請訪問http://rubular.com/

這是一個非常有用的工具，並且在頁面底部具有正則表達式的大多數基礎知識。

紅寶石正則表達式和多行字符串

問題描述

2 個解決方案

解決方案1
1 已采納 2013-08-05 21:18:37

解決方案2
0 2013-08-05 19:50:15

紅寶石正則表達式和多行字符串

問題描述

2 個解決方案

解決方案1 1 已采納 2013-08-05 21:18:37

解決方案2 0 2013-08-05 19:50:15

解決方案1
1 已采納 2013-08-05 21:18:37

解決方案2
0 2013-08-05 19:50:15