简体   繁体   中英

How to delete a row in a csv file with powershell in R?

Good morning,

I'm new about powershell and I'd like to ask you if somebody can help me.

I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.

> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)

The error is:

Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
  Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line

I tried to use different way (I putted in my code fill=true , but doesn't work) to solve the problem, but I couldn't do it.

After different researches I found this kind of solution (always to do in R):

>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")

The focus about the use of powershell in R is to delete a specific row and to load with fread the file.

My question is:

How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?

Thank you in advance for every type of help.

Francesco

I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.

A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.

To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.

$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
    # Split the line on semicolons
    $elements = $line -split ';'
    # If there were $colonCount elements, add those to builder
    if($elements.count -eq $colonCount) {
        # If $line's contents need modifications, do it here
        # before adding it into the builder
        [void]$sb.AppendLine($line)
        ++$i
    }
    # Write builder contents into file every now and then
    if($i -ge $MaxRows) {
        add-content "MyCleanCsvFile.csv" $sb.ToString()
        [void]$sb.Clear()
        $i = 0
    }
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
    add-content "MyCleanCsvFile.csv" $sb.ToString()
}

This is easy done in powershell: Read csv in generic list, remove line and write back:

Add-Type -AssemblyName System.Collections

[System.Collections.Generic.List[string]]$csvList = @()

$csvFile = 'C:\test\myfile.csv' 
$csvList = [System.IO.File]::ReadLines( $csvFile )

$lineToDelete = 2

[void]$csvList.RemoveAt( $lineToDelete - 1 )

[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null

vonPryz's helpful answer offers the best solution, given the size of your input file.

The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537 ):

# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv

The beneficial flip side of use of the pipeline is that it acts as a memory throttle , so the above command can be used to process arbitrarily large files (though it may take a long time).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM