first of all, thank you for taking the time to help me!
I am currently working on a machine learning problem using python where I have to extract several specific sections in a large text file for training a classification algorithm. The texts then have to be saved in a CSV format with its corresponding ID-num and label/category from an excel sheet.
The CSV file should look like this: https://imgur.com/a/3cntJlL
The excel sheet contains a lot of columns where only the ID-number and label columns should be used.
Here you can see some of the excel sheet: https://imgur.com/a/AZlWdeE
IDNUM column is the ID-number which connects the excel sheet to a specific text. The AType1 column is the corresponding label which also has to be saved.
Here you can see some of one of the text files: https://imgur.com/a/Yns8HAC
The text which should be extracted goes from the word "Text:" to where there are two "*" (stars) right after each other in two lines. The ID-num is placed above the section, as the picture shows.
I have been trying to split the document but I can seem to figure out how to make the CSV file containing information from both an excel sheet and the text file. It would be optimal to make a script that can do this in one run and maybe then loop through several large text files.
So, my problem is to create a script which can:
I hope there is someone who can help me. I am on the beginner level of using python so making this kind of script is pretty challenging.
Looking forward to hearing your ideas!
// Rasmus
It would be good for you to familiarize yourself with the pandas library.
Pandas ( https://pandas.pydata.org/docs/ ) will allow you to read a CSV file into what is called a dataframe and manipulate the data by column name and rows. You can also put your results into a pandas dataframe and write the results to a CSV file.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.