简体   繁体   中英

i have 200 text file in hindi. want to remove white space the special character and find the find bigram and trigram in python

import os

dir=os.getcwd()
print(dir)
dir1=os.path.join(dir,"test")
filename=os.listdir(dir1)
bad_chars = [';', ':', '!', "*","#","%"]
for i in filename:
    filepath=os.path.join(dir1,i)  #  the path
    file=open(filepath,"r",encoding="utf8") #open first text file
    read_=file.read()
    fields = read_.split(" ")
    print(fields)
    file1=open(filepath,"w",encoding="utf8")
    file2=open(filepath,"a",encoding="utf8")
    for j in range(len(fields)):        
        for p in bad_chars :
            fields[j].replace(i,' ')
            file2.write(fields[j])
            print ("Resultant list is : " , fields[j])
file.close()
file1.close()
file2.close()

I am trying to remove special character fro all the 200 text file

this is the code for bigram which I found

example my name is eshan. output my, name occurs 1 name,is occurs 1 is, advance occurs 1 occurance can be more then 1 according to text

Try this way:

for file in filename:
    filepath=os.path.join(dir1,file)


    with open('inp.txt','r+') as f:
      texts = f.read()
      for c in bad_chars:
        texts=texts.replace(c,' ')

    #write to the file
    with open('inp.txt','w') as f:
      f.write(texts)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM