简体   繁体   中英

PYTHON: Extract Non-English words and iterate it over a dataframe

I have a table of about 30,000 rows and need to extract non-English words from a column named dummy_df from a dummy_df dataframe. I need to put the non-english words in an adjacent column named non_english .

A dummy data is as thus:

dummy_df = pandas.DataFrame({'outcome':    ["I want to go to church",  "I love Matauranga", "Take me to  Oranga Tamariki"]})

My idea is to extract non-English words from a sentence, and then iterate the process over a dataframe. I was able to accurately extract non-English words from a sentence with this code:

import nltk
nltk.download('words')
from nltk.corpus import words

words = set(nltk.corpus.words.words())

sent = "I love Matauranga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if not w.lower() in words or not w.isalpha())

The result of the above code is 'Matauranga' which is perfectly correct.

But when I try to iterate the code over a dataframe using this code:

import nltk
nltk.download('words')
from nltk.corpus import words

def no_english(text):
  words = set(nltk.corpus.words.words())
  " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
         if not w.lower() in words or not w.isalpha())

dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)

I got an undesirable result in that the non_english column has none value instead of the desired non-english words (see below):

                       outcome non_english
0       I want to go to church        None
1            I love Matauranga        None
2  Take me to  Oranga Tamariki        None
3                                     None

Instead, the desired result should be:

                       outcome        non_english
0       I want to go to church        
1            I love Matauranga        Matauranga
2  Take me to  Oranga Tamariki        Oranga Tamariki

You are missing the return in your function:

import nltk
nltk.download('words')
from nltk.corpus import words

def no_english(text):
    words = set(nltk.corpus.words.words())
    return " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
           if not w.lower() in words or not w.isalpha())

dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)

output:

                       outcome      non_english
0       I want to go to church                 
1            I love Matauranga       Matauranga
2  Take me to  Oranga Tamariki  Oranga Tamariki

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM