简体   繁体   中英

Convert multiple txt to csv without data loss

Disclaimer: I am a Python novice and would very much appreciate detailed answers.

Update : Removed non-relevant code.

Update : The Problem was the Excel limit of strings per cell. I provided by own solution based on a proposed solution below.

I want to merge multiple.txt-files into a single.csv-file by row. Here is some replication data . The attempted output file is data_replication.csv . As you can see, only two of the five.txt-files were successfully integrated into the.csv-file. There, you'll also be able to find the input files in.pdf-form. It's unstructured random papers I found on Google Scholar.

The function I was using was proposed by Bill Bell in 'Combine a folder of text files into a CSV with each content in a cell' .

The function I used for the conversion from.pdf to.txt was proposed b hkr to the similar question 'Convert a PDF files to TXT files' :

def txt_to_csv(x): 

    os.chdir('/content/drive/MyDrive/ThesisAllocationSystem/' + x)

    with open(x + '.csv', 'w', encoding = 'Latin-1') as out_file:
        csv_out = csv.writer(out_file)
        csv_out.writerow(['FileName', 'Content'])
        for fileName in Path('.').glob('*.txt'):
            lines = [ ]
            with open(str(fileName.absolute()),'rb') as one_text:
                for line in one_text.readlines():
                    lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
            csv_out.writerow([str(fileName),' '.join(lines)])

txt_to_csv('data_replication')

I'm guessing that data type might be the problem here, and appreciate any attempt to help me.

You can use pandas for this:

from glob import glob
import pandas as pd

files = glob('/content/drive/MyDrive/ThesisAllocationSystem/*.txt') # create list of text files
data = [[i, open(i, 'rb').read()] for i in files] # create a list of lists with file names and texts
df = pd.DataFrame(data, columns=['FileName', 'Content']) # load the data in a pandas dataframe
df.to_csv('data_replication.csv') # save to csv

Using RJ Adriaansen's proposed function as a blueprint, I created the following function for people suffering under the same constraint: Excel's hard limit for strings per cell: 32767.

One approach would be to forego the documents with string content of more than 33k. However, that would have led to considerable data loss in my case.

Instead, I sliced the documents to exactly 32767 strings.

from glob import glob
import pandas as pd

def txt_to_csv(input_dir, output_dir, new_filename): 
  
  files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
  data = [[i, open(i, 'rb').read()] for i in files]
  df = pd.DataFrame(data, columns = ['FileName', 'Content'])
  df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
  df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)

txt_to_csv('data_replication', 'data_replication', 'trial')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM