简体   繁体   中英

Ruby extract data from string using regex

I'm doing some web scraping, this is the format for the data

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

The actual string that i receive is of the following form

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

The things that I am interested in are the Course_Code, Course_Name and the Grade, in this example the values would be

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

Is there some way for me to use a regular expression or some other technique to easily extract this information instead of manually parsing through the string. I'm using jruby in 1.9 mode.

Let's use Ruby's named captures and a self-describing regex!

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A

Just for fun:

str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}

Do I see that correctly that the delimiter is always 3 spaces? Then just:

serial_number, course_code, course_name, credit, grade, attendance_grade = 
  the_string.split('   ')

Assuming everything except for the course description consists of single words and there are no leading or trailing spaces:

/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/

Your example string will yield the following match groups:

1.  1
2.  CA727
3.  PRINCIPLES OF COMPILER DESIGN
4.  3
5.  A
6.  M

This answer isn't very idiomatic Ruby, because in this case I think clarity is better than being clever. All you really need to do to solve the problem you described is to split your lines with whitespace:

line = '1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]

This assumes your data is regular. If not, you will need to work harder at tuning your regular expression and possibly handling edge cases where you don't have the required number of fields.

A Note for Posterity

The OP changed the input string, and modified the delimiter to a single space between fields. I'll leave my answer to the original question as-is (including the original input string for reference) as it may help others besides the OP in a less-specific case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM