Disclaimer: I am a Python novice and would very much appreciate detailed answers.
Update : Removed non-relevant code.
Update : The Problem was the Excel limit of strings per cell. I provided by own solution based on a proposed solution below.
I want to merge multiple.txt-files into a single.csv-file by row. Here is some replication data . The attempted output file is data_replication.csv
. As you can see, only two of the five.txt-files were successfully integrated into the.csv-file. There, you'll also be able to find the input files in.pdf-form. It's unstructured random papers I found on Google Scholar.
The function I was using was proposed by Bill Bell in 'Combine a folder of text files into a CSV with each content in a cell' .
The function I used for the conversion from.pdf to.txt was proposed b hkr to the similar question 'Convert a PDF files to TXT files' :
def txt_to_csv(x):
os.chdir('/content/drive/MyDrive/ThesisAllocationSystem/' + x)
with open(x + '.csv', 'w', encoding = 'Latin-1') as out_file:
csv_out = csv.writer(out_file)
csv_out.writerow(['FileName', 'Content'])
for fileName in Path('.').glob('*.txt'):
lines = [ ]
with open(str(fileName.absolute()),'rb') as one_text:
for line in one_text.readlines():
lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
csv_out.writerow([str(fileName),' '.join(lines)])
txt_to_csv('data_replication')
I'm guessing that data type might be the problem here, and appreciate any attempt to help me.
You can use pandas
for this:
from glob import glob
import pandas as pd
files = glob('/content/drive/MyDrive/ThesisAllocationSystem/*.txt') # create list of text files
data = [[i, open(i, 'rb').read()] for i in files] # create a list of lists with file names and texts
df = pd.DataFrame(data, columns=['FileName', 'Content']) # load the data in a pandas dataframe
df.to_csv('data_replication.csv') # save to csv
Using RJ Adriaansen's proposed function as a blueprint, I created the following function for people suffering under the same constraint: Excel's hard limit for strings per cell: 32767.
One approach would be to forego the documents with string content of more than 33k. However, that would have led to considerable data loss in my case.
Instead, I sliced the documents to exactly 32767 strings.
from glob import glob
import pandas as pd
def txt_to_csv(input_dir, output_dir, new_filename):
files = glob('/content/drive/MyDrive/ThesisAllocationSystem/' + input_dir + '/*.txt')
data = [[i, open(i, 'rb').read()] for i in files]
df = pd.DataFrame(data, columns = ['FileName', 'Content'])
df['Content'] = df['Content'].str.slice(start = 0, stop = 32767) # Upper limit of strings per cell in csv
df.to_csv(output_dir + '/' + new_filename + '.csv', index = False)
txt_to_csv('data_replication', 'data_replication', 'trial')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.