简体   繁体   中英

How can I split a large CSV file into multiple files of rougly equal size using bash tools alone?

Note that the CSV file may or may not have multiple line breaks in each cell, and each split file must also be a valid CSV file.

I have tried using split, however, if I split by number of lines, it doesn't take into account that the CSV can have line breaks inside fields, and if I split by filesize, it sometimes cuts the last line of the file in half, meaning that it is no longer a valid CSV file.

You can find a test file here: https://pastebin.com/raw/pw9PF9U1

It looks like this:

post_title,tax:wcpv_product_vendors,post_content
Product title 1,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 2,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 3,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 4,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 5,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 6,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 7,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 8,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"
Product title 9,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"

Also note each row of the csv has a ^M symbol at the end of it when I open it in vim. That may be useful to split correctly.

将file.csv拆分为5个文件

split -n 5 file.csv

If you need to support embedded newline characters, then there is no easy way to do this right using Bash alone. Otherwise split could have been a good choice.

You could implement a CSV parser (of your desired dialect) in Bash, but it would be a lot of work for a fragile solution.

It's better to not use Bash for this, but some other language with good library support for proper CSV parsing. Such as Python, which comes with a csv package included.

Here's one in awk. You provide it with filename and the maximum number of "lines" (for example -vm=3 ) you want in one file and it splits the file (based on your data) on lines that do not start with < so basically the header and the product title lines:

$ awk -v m=3 'NR==1{j=0}{if($0!~/^</){i++;if(i>m){i=1;j++}};print > "split-" j}' file
$ ls -1rt
split-3
split-2
split-1
split-0
$ cat split-3
Product title 9,Sample,"<div class=""productdetails"">
<h2 style=""margin: 0px 0px 15px; line-height: 1.2; text-align: center;"">Title</h2>
<p style=""color: #333333; margin: 0px; font-size: 13px; line-height: 23.1111px; padding: 0px; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';""><strong>Features:</strong></p>
<ul style=""padding: 0px 40px; margin: 0px; color: #333333; font-family: sans-serif, Arial, Verdana, 'Trebuchet MS'; font-size: 13px; line-height: 20.8px;"">
<li style=""list-style: none;"">Testing testing</li>
<li style=""list-style: none;"">One two three</li>
</ul>
</div>"

Explained:

awk -v m=3 '           # provide m
NR==1 {                # on the first record
    j=0                # set j to 0
}
{
    if($0!~/^</) {     # when a line not starting with a < is met
        i++            # increase line counter
        if(i>m) {      # if line counter exceeds max
            i=1        # reset it back to 1
            j++        # split file name index
        }
    }
    print > "split-" j # output
}' file

Here is a method that still makes split usable.

The incentive here is to use a null byte character \\0 instead of a new line character \\n as a record separator for splitting.

First, we can use sed to add a \\0 to the beginning of each line that does not begin with <

sed 's/^[^<]/\x0&/' file.csv > file_tmp.csv

Next, we can use split as usual

split -n l/5 -t '\0' --filter='sed 's/\x0//g' > $FILE.csv' file_tmp.csv split_
  • -nl/5 splits file into roughly 5 equal parts without splitting records
  • -t '\\0' uses the null byte character as a record separator
  • --filter='sed 's/\\x0//g' > $FILE.csv' removes all null byte characters from splitted files

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM