简体   繁体   中英

Removing uneven spaces from input file in ruby

I have a text file which contains balance sheet information for the company. The problem is that the spacing is uneven and I get data like this

28/07/15                 2.85                                104,689.13
30/07/15                                 31,862.00           136,551.13

The reason is that 2.85 on first line is a debit and second is a credit.

How can I get the data in ruby so that I get 4 elements from the line with credit being empty on first and debit on second.

I can split the data based on multiple spaces and then compare the balance between successive lines to get credit vs debit information but I want to know if there is a better way(maybe regex) to do this.

Thank you.

Here's a way that will work even if the lines are really messed up. It relies on the fact that debits (credits) reduce (increase) the balance by the amount of the debit (credit). Let's first write some data to file:

data =<<_
28/07/15                 2.85  104,689.13
30/07/15        31,862.00                                    136,551.13
                                 28/07/15 1.13 136,550.00
30/07/15                                 10,000.01           146,550.01
_

FName = 'temp'
IO.write(FName, data)
  #=> 288

The method for extracting the fields follows. It requires the file name and starting balance. Alternatively, the second argument could be a boolean indicating whether the first line contains a debit or a credit.

require 'bigdecimal'

def extract_transactions(fname, starting_balance)
  transactions = []
  IO.readlines(FName).reduce(BigDecimal.new(starting_balance)) do |start_bal,s|
    date, debit_or_credit, bal = s.strip.delete(',').split(/\s+/)
    h = { date: date, debit: '', credit: '', balance: bal }
    if BigDecimal.new(bal) == start_bal - BigDecimal.new(debit_or_credit)
      h[:debit] = debit_or_credit
    else
      h[:credit] = debit_or_credit
    end
    transactions << h
    BigDecimal.new(bal)
  end
  transactions          
end

Let's try it:

extract_debits_and_credits(FName, "104691.98")
  #=> [{:date=>"28/07/15", :debit=>"2.85", :credit=>"", :balance=>"104689.13"},
  #    {:date=>"30/07/15", :debit=>"", :credit=>"31862.00", :balance=>"136551.13"},
  #    {:date=>"28/07/15", :debit=>"1.13", :credit=>"", :balance=>"136550.00"},
  #    {:date=>"30/07/15", :debit=>"", :credit=>"10000.01", :balance=>"146550.01"}]

I used BigDecimal to avoid problems with round-off errors.

Enumerable#reduce (aka inject ) updates the balance ( start_bal , initially starting_balance ) after each transaction (row).

Edit: here's a non- BigDecimal variant (that's better):

def extract_transactions(fname, debit_first)
  curr_bal = (debit_first ? Float::INFINITY : -Float::INFINITY)
  IO.readlines(FName).each_with_object([]) do |s, transact|
    date, debit, bal = s.strip.split(/\s+/)
    credit = ''
    bal_float = bal.delete(',').to_f
    (debit, credit = credit, debit) if bal_float > curr_bal
    transact << { date: date, debit: debit, credit: credit, balance: bal }
    curr_bal = bal_float
  end
end

extract_transactions(FName, true)
  #=> [{:date=>"28/07/15", :debit=>"2.85", :credit=>"", :balance=>"104689.13"},
  #    {:date=>"30/07/15", :debit=>"", :credit=>"31862.00", :balance=>"136551.13"},
  #    {:date=>"28/07/15", :debit=>"1.13", :credit=>"", :balance=>"136550.00"},
  #    {:date=>"30/07/15", :debit=>"", :credit=>"10000.01", :balance=>"146550.01"}]

The only constant you have is the string length (71, but 72 is divided by 4 and therefore is a correct value). We might try to use it:

▶ data = %q|28/07/15                 2.85                                104,689.13
▷ 30/07/15                                 31,862.00           136,551.13|
▶ data.split($/).map do |line|
▷   #                         ⇓⇓ ≡ string length + 1 / amount of items
▷   line.split(//).each_slice(18).map(&:join).map(&:strip)
▷ end
#⇒ [
#  [0] [
#    [0] "28/07/15",
#    [1] "2.85",
#    [2] "",
#    [3] "104,689.13"
#  ],
#  [1] [
#    [0] "30/07/15",
#    [1] "",
#    [2] "31,862.00",
#    [3] "136,551.13"
#  ]
# ]

How can I get the data in ruby so that I get 4 elements from the line with credit being empty on first and debit on second.
I want to know if there is a better way(maybe regex)

I gave the last 4 columns


I will assume the following column widths:

+--------------+--------+----------------------------+------------------+-----------------+
|Previous col  | DATE   | DEBIT                      | CREDIT           | BALANCE         |
| (any width)  | width=8|  (width=28)                |  (width=18)      |  (width=17)     |
+--------------+--------+----------------------------+------------------+-----------------+
|      ...     |28/07/15|                 2.85       |                  |       104,689.13|
|      ...     |30/07/15|                            |     31,862.00    |       136,551.13|
+--------------+--------+----------------------------+------------------+-----------------+


If you think about it, we can match the whole width of the last column with /.{17}$/ . The trick here is to use a lookahead to capture the value of the field, from the position 17 chars to the left of the end of the line, and moving forward:

/(?=[ ]{0,16}([\d,.]+)).{17}$/

Credit is the previous column, and its width is 18 characters /.{18}/ , but since it's an optional field, we need to enclose the lookahead in an optional group. If we prefix this pattern to the last regex, we now have:

/(?:(?=[ ]{0,17}([\d,.]+)))?.{18}(?=[ ]{0,16}([\d,.]+)).{17}$/

And we use the same logic to complete all 4 fields into this one-liner regex (break-down in the code below):

/(?=[ ]{0,7}(?<date>[\d\/]+)).{8}(?:(?=[ ]{0,27}(?<debit>[\d,.]+)))?.{28}(?:(?=[ ]{0,17}(?<credit>[\d,.]+)))?.{18}(?=[ ]{0,16}(?<balance>[\d,.]+)).{17}[ ]*$/
  • Notice I'm using named groups for both readability and practicality in the code.

regex101 DEMO


Code:

data = %q|28/07/15                 2.85                                104,689.13
30/07/15                                 31,862.00           136,551.13|

regex = /
    (?=[ ]{0,7}(?<date>[\d\/]+))          # Field 1: date
    .{8}                                  #  column 1 (width=8)
                                          #
    (?:(?=[ ]{0,27}(?<debit>[\d,.]+)))?   # Field 2: debit (optional)
    .{28}                                 #  column 2 (width=28)
                                          #
    (?:(?=[ ]{0,17}(?<credit>[\d,.]+)))?  # Field 3: credit (optional)
    .{18}                                 #  column 3 (width=18)
                                          #
    (?=[ ]{0,16}(?<balance>[\d,.]+))      # Field 4: balance
    .{17}                                 #  column 4 (width=17)
                                          #
    [ ]*$                                 # optional spaces -> EoL
/x


# hash from all named captures from all matches
result = data.scan(regex).collect do |match| Hash[regex.names.zip(match)] end

p result
#=> [{"date"=>"28/07/15", "debit"=>"2.85", "credit"=>nil, "balance"=>"104,689.13"}, 
#    {"date"=>"30/07/15", "debit"=>nil, "credit"=>"31,862.00", "balance"=>"136,551.13"}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM