简体   繁体   English

Ruby使用正则表达式从字符串中提取数据

[英]Ruby extract data from string using regex

I'm doing some web scraping, this is the format for the data 我正在做一些网页抓取,这是数据的格式

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

The actual string that i receive is of the following form 我收到的实际字符串具有以下形式

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

The things that I am interested in are the Course_Code, Course_Name and the Grade, in this example the values would be 我感兴趣的是Course_Code,Course_Name和Grade,在此示例中,值将为

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

Is there some way for me to use a regular expression or some other technique to easily extract this information instead of manually parsing through the string. 有什么办法让我使用正则表达式或其他技术轻松地提取此信息,而不是通过字符串手动解析。 I'm using jruby in 1.9 mode. 我在1.9模式下使用jruby。

Let's use Ruby's named captures and a self-describing regex! 让我们使用Ruby的命名捕获和自描述正则表达式!

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A

Just for fun: 只是为了好玩:

str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}

Do I see that correctly that the delimiter is always 3 spaces? 我是否正确地看到分隔符始终为3个空格? Then just: 然后:

serial_number, course_code, course_name, credit, grade, attendance_grade = 
  the_string.split('   ')

Assuming everything except for the course description consists of single words and there are no leading or trailing spaces: 假设除课程描述以外的所有内容都由一个单词组成,并且没有前导或尾随空格:

/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/

Your example string will yield the following match groups: 您的示例字符串将产生以下匹配组:

1.  1
2.  CA727
3.  PRINCIPLES OF COMPILER DESIGN
4.  3
5.  A
6.  M

This answer isn't very idiomatic Ruby, because in this case I think clarity is better than being clever. 这个答案不是非常惯用的Ruby,因为在这种情况下,我认为清晰度要比聪明好。 All you really need to do to solve the problem you described is to split your lines with whitespace: 解决您描述的问题所需要做的就是用空格分隔行:

line = '1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]

This assumes your data is regular. 假设您的数据是定期的。 If not, you will need to work harder at tuning your regular expression and possibly handling edge cases where you don't have the required number of fields. 如果不是这样,您将需要更加努力地调整正则表达式,并可能在没有所需字段数的情况下处理边缘情况。

A Note for Posterity 后代笔记

The OP changed the input string, and modified the delimiter to a single space between fields. OP更改了输入字符串,并将定界符修改为字段之间的单个空格。 I'll leave my answer to the original question as-is (including the original input string for reference) as it may help others besides the OP in a less-specific case. 我将对原始问题的答案保持原样(包括原始输入字符串以供参考),因为在不太具体的情况下,它可能会帮助OP之外的其他人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM