Ruby使用正則表達式從字符串中提取數據

Question

我正在做一些網頁抓取，這是數據的格式

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

我收到的實際字符串具有以下形式

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

我感興趣的是Course_Code，Course_Name和Grade，在此示例中，值將為

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

有什么辦法讓我使用正則表達式或其他技術輕松地提取此信息，而不是通過字符串手動解析。 我在1.9模式下使用jruby。

Answer 1

讓我們使用Ruby的命名捕獲和自描述正則表達式！

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A

Answer 2

只是為了好玩：

str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}

Answer 3

我是否正確地看到分隔符始終為3個空格？ 然后：

serial_number, course_code, course_name, credit, grade, attendance_grade = 
  the_string.split('   ')

Answer 4

假設除課程描述以外的所有內容都由一個單詞組成，並且沒有前導或尾隨空格：

/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/

您的示例字符串將產生以下匹配組：

1.  1
2.  CA727
3.  PRINCIPLES OF COMPILER DESIGN
4.  3
5.  A
6.  M

Answer 5

這個答案不是非常慣用的Ruby，因為在這種情況下，我認為清晰度要比聰明好。 解決您描述的問題所需要做的就是用空格分隔行：

line = '1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]

假設您的數據是定期的。 如果不是這樣，您將需要更加努力地調整正則表達式，並可能在沒有所需字段數的情況下處理邊緣情況。

后代筆記

OP更改了輸入字符串，並將定界符修改為字段之間的單個空格。 我將對原始問題的答案保持原樣（包括原始輸入字符串以供參考），因為在不太具體的情況下，它可能會幫助OP之外的其他人。

Ruby使用正則表達式從字符串中提取數據

問題描述

5 個解決方案

解決方案1
40 已采納 2012-06-05 21:35:37

解決方案2
6 2012-06-06 01:19:47

解決方案3
3 2012-06-05 21:34:14

解決方案4
3 2012-06-05 21:36:21

解決方案5
1 2012-06-05 21:36:38

后代筆記

Ruby使用正則表達式從字符串中提取數據

問題描述

5 個解決方案

解決方案1 40 已采納 2012-06-05 21:35:37

解決方案2 6 2012-06-06 01:19:47

解決方案3 3 2012-06-05 21:34:14

解決方案4 3 2012-06-05 21:36:21

解決方案5 1 2012-06-05 21:36:38

后代筆記

解決方案1
40 已采納 2012-06-05 21:35:37

解決方案2
6 2012-06-06 01:19:47

解決方案3
3 2012-06-05 21:34:14

解決方案4
3 2012-06-05 21:36:21

解決方案5
1 2012-06-05 21:36:38