I have a text file to parse. In this file, each record has content spread over a variable number of lines. The number of rows per record is not a fixed number. The content of the file looks like this:
ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent
I want to slice it where there is a record in the first tab column (the ID column is empty in the following lines, so this way to determine a new record should work).
My current code for splitting it into chunks of five lines and then merging it:
f = File.read(file).each_line
f.each_slice(5) do | slice_to_handle |
merged_row = slice_to_handle.delete("\n").split("\t").collect(&:strip)
# Dealing with the data here..
end
I need to modify this to slice it as soon as there is an ID set in the first column.
File.read(file)
.split(/^(?!\t)/)
.map{|record| record.split("\t").map(&:strip)}
Result
[
[
"ID",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content"
],
[
"ID",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content",
"content"
],
[
"ID",
"content",
"content",
"content",
"content"
]
]
Ruby's Array inherits from Enumerable, which has slice_before
, which is your friend:
text_file = "ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent".split("\n")
text_file.slice_before(/^ID/).map(&:join)
Which looks like:
[
"ID\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent",
"ID\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent",
"ID\tcontent\tcontent\tcontent\tcontent"
]
text_file
is an array of lines, similar to what you'd get if you slurped a file using readlines
.
slice_before
iterates over the array looking for matches to the /^ID/
pattern, and creates a new sub-array each time it's found.
map(&:join)
walks over the sub-arrays and joins their contents into a single string.
This is not very scalable though. Using it, you'd be relying on being able to slurp in the entire file into memory, which can stop a machine in its tracks. Instead, it's better to read the content line-by-line and break the blocks and process them as soon as possible.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.