简体   繁体   中英

How to match excel-sheet cell (using ID-number) with a number in a text file and then extract and save text with id and label as csv

first of all, thank you for taking the time to help me!

I am currently working on a machine learning problem using python where I have to extract several specific sections in a large text file for training a classification algorithm. The texts then have to be saved in a CSV format with its corresponding ID-num and label/category from an excel sheet.

The CSV file should look like this: https://imgur.com/a/3cntJlL

The excel sheet contains a lot of columns where only the ID-number and label columns should be used.

Here you can see some of the excel sheet: https://imgur.com/a/AZlWdeE

IDNUM column is the ID-number which connects the excel sheet to a specific text. The AType1 column is the corresponding label which also has to be saved.

Here you can see some of one of the text files: https://imgur.com/a/Yns8HAC

The text which should be extracted goes from the word "Text:" to where there are two "*" (stars) right after each other in two lines. The ID-num is placed above the section, as the picture shows.

I have been trying to split the document but I can seem to figure out how to make the CSV file containing information from both an excel sheet and the text file. It would be optimal to make a script that can do this in one run and maybe then loop through several large text files.

So, my problem is to create a script which can:

  1. Match excel cell content (ID-number) with text
  2. Extract a section of the text between two delimiters ("Text:" and "* \n *")
  3. Save the text, ID-number and label in a CSV file.

I hope there is someone who can help me. I am on the beginner level of using python so making this kind of script is pretty challenging.

Looking forward to hearing your ideas!

// Rasmus

It would be good for you to familiarize yourself with the pandas library.

Pandas ( https://pandas.pydata.org/docs/ ) will allow you to read a CSV file into what is called a dataframe and manipulate the data by column name and rows. You can also put your results into a pandas dataframe and write the results to a CSV file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM