Note that the CSV file may or may not have multiple line breaks in each cell, and each split file must also be a valid CSV file.
I have tried using split, however, if I split by number of lines, it doesn't take into account that the CSV can have line breaks inside fields, and if I split by filesize, it sometimes cuts the last line of the file in half, meaning that it is no longer a valid CSV file.
You can find a test file here: https://pastebin.com/raw/pw9PF9U1
It looks like this:
post_title,tax:wcpv_product_vendors,post_content
Product title 1,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 2,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 3,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 4,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 5,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 6,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 7,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 8,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 9,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Also note each row of the csv has a ^M symbol at the end of it when I open it in vim. That may be useful to split correctly.
将file.csv拆分为5个文件
split -n 5 file.csv
If you need to support embedded newline characters, then there is no easy way to do this right using Bash alone. Otherwise split
could have been a good choice.
You could implement a CSV parser (of your desired dialect) in Bash, but it would be a lot of work for a fragile solution.
It's better to not use Bash for this, but some other language with good library support for proper CSV parsing. Such as Python, which comes with a csv
package included.
Here's one in awk. You provide it with filename and the maximum number of "lines" (for example -vm=3
) you want in one file and it splits the file (based on your data) on lines that do not start with <
so basically the header and the product title lines:
$ awk -v m=3 'NR==1{j=0}{if($0!~/^</){i++;if(i>m){i=1;j++}};print > "split-" j}' file
$ ls -1rt
split-3
split-2
split-1
split-0
$ cat split-3
Product title 9,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Explained:
awk -v m=3 ' # provide m
NR==1 { # on the first record
j=0 # set j to 0
}
{
if($0!~/^</) { # when a line not starting with a < is met
i++ # increase line counter
if(i>m) { # if line counter exceeds max
i=1 # reset it back to 1
j++ # split file name index
}
}
print > "split-" j # output
}' file
Here is a method that still makes split
usable.
The incentive here is to use a null byte character \\0
instead of a new line character \\n
as a record separator for splitting.
First, we can use sed
to add a \\0
to the beginning of each line that does not begin with <
sed 's/^[^<]/\x0&/' file.csv > file_tmp.csv
Next, we can use split
as usual
split -n l/5 -t '\0' --filter='sed 's/\x0//g' > $FILE.csv' file_tmp.csv split_
-nl/5
splits file into roughly 5 equal parts without splitting records -t '\\0'
uses the null byte character as a record separator --filter='sed 's/\\x0//g' > $FILE.csv'
removes all null byte characters from splitted files
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.