简体   繁体   中英

Preprocessing corpus stored in DataFrame with NLTK

I'm learning NLP and I'm trying to understand how to perform pre-processing on a corpus stored in a pandas DataFrame. So let's say I have this:

import pandas as pd

doc1 = """"Whitey on the Moon" is a 1970 spoken word poem by Gil Scott-Heron. It was released as the ninth track on Scott-Heron's debut album Small Talk at 125th and Lenox. It tells of medical debt and poverty experienced during the Apollo Moon landings. The poem critiques the resources spent on the space program while Black Americans were experiencing marginalization. "Whitey on the Moon" was prominently featured in the 2018 biographical film about Neil Armstrong, First Man."""
doc2 = """St Anselm's Church is a Roman Catholic church which is part of the Personal Ordinariate of Our Lady of Walsingham in Pembury, Kent, England. It was originally founded in the 1960s as a chapel-of-ease before becoming its own quasi-parish within the personal ordinariate in 2011, following a conversion of a large number of disaffected Anglicans in Royal Tunbridge Wells."""
doc3 = """Nymphargus grandisonae (common name: giant glass frog, red-spotted glassfrog) is a species of frog in the family Centrolenidae. It is found in Andes of Colombia and Ecuador. Its natural habitats are tropical moist montane forests (cloud forests); larvae develop in streams and still-water pools. Its habitat is threatened by habitat loss, introduced fish, and agricultural pollution, but it is still a common species not considered threatened by the IUCN."""

df = pd.DataFrame({'text': [doc1, doc2, doc3]})

Which results in:

+---+---------------------------------------------------+
|   |                                              text |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+

Now, I load what I need and tokenize the text:

import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

df['tokenized_text'] = df['text'].apply(word_tokenize)
df

Which gives the following output:

+---+---------------------------------------------------+---------------------------------------------------+
|   |                                              text |                                    tokenized_text |
+---+---------------------------------------------------+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... |
+---+---------------------------------------------------+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... |
+---+---------------------------------------------------+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, (, common, name, :, ... |
+---+---------------------------------------------------+---------------------------------------------------+

Now, my problem occurs when removing stop words:

df['tokenized_text'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not  in [stop_words] + list(string.punctuation)])

Which looks like nothing happened:

+---+---------------------------------------------------+---------------------------------------------------+
|   |                                              text |                                    tokenized_text |
+---+---------------------------------------------------+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... | [``, Whitey, on, the, Moon, '', is, a, 1970, s... |
+---+---------------------------------------------------+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... | [St, Anselm, 's, Church, is, a, Roman, Catholi... |
+---+---------------------------------------------------+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... | [Nymphargus, grandisonae, common, name, giant,... |
+---+---------------------------------------------------+---------------------------------------------------+

Can someone help me understand what happens and what I should do instead?

After that, I'd like to apply lemmatization, but that doesn't work in the current state:

lemmatizer = WordNetLemmatizer
df['tokenized_text'] = df['tokenized_text'].apply(lemmatizer.lemmatize)

yields:

TypeError: lemmatize() missing 1 required positional argument: 'word'

Thanks!

First Issue

stop_words = set(stopwords.words('english')) and ... if word not in [stop_words] : you created a set with just one element - the list of stopwords. No word equals this whole list, therefore stopwords are not removed. So it must be:
stop_words = stopwords.words('english')
df['tokenized_text'].apply(lambda words: [word for word in words if word not in stop_words + list(string.punctuation)])

Second Issue

lemmatizer = WordNetLemmatizer here you assign the class but you need to create an object of this class: lemmatizer = WordNetLemmatizer()

Third Issue

You can't lemmatize a whole list in one take, instead you need to lemmatize word by word: df['tokenized_text'].apply(lambda words: [lemmatizer.lemmatize(word) for word in words])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM