I have a dataframe with 3 columns namely 'word', 'pos-tag', 'label' . The words are originally from a text file.Now I would like to have another column 'sentences#' stating the index of sentences the words originally came from.
Current state:-
WORD POS-Tag Label
my PRP$ IR
name NN IR
is VBZ IR
ron VBN PERSON
. .
my PRP$ IR
name NN IR
is VBZ IR
harry VBN Person
. . IR
Desired state:-
Sentence# WORD Pos-Tag Label
1 My PRP IR
1 name NN IR
1 is VBZ IR
1 ron VBN Person
1 . . IR
2 My PRP IR
2 name NN IR
2 is VBZ IR
2 harry VBN Person
2 . . IR
code I used till now:-
#necessary libraries
import pandas as pd
import numpy as np
import nltk
import string
document=open(r'C:\Users\xyz\newfile.txt',encoding='utf8')
content=document.read()
sentences = nltk.sent_tokenize(content)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
flat_list=[]
# flattening a nested list
for x in sentences:
for y in x:
flat_list.append(y)
df = pd.DataFrame(flat_list, columns=['word','pos_tag'])
#importing data to create the 'Label' column
data=pd.read_excel(r'C:\Users\xyz\pname.xlsx')
pname=list(set(data['Product']))
df['Label']=['drug' if x in fl else 'IR' for x in df['word']]
Just split your content into lines beforehand using split() with the appropriate punctuation marks. Store each line in some list, and then for index, line in enumerate(lines): do what you've normally done and also add index to your df.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.