简体   繁体   中英

Python - How to remove a sentence if it contains spanish words

As the title states, I've got a dataset that includes strings that are either english or spanish. Prior to preprocessing, I want to remove any row that includes Spanish words.

Should I just use a spanish corpus loop through the entire dataset to see of any spanish words exist in the sentence?

Any help would be much appreciated.

I think the library you'll want to use is langdetect . Here's some example code I just whipped up + output.

from langdetect import detect

sentences = ["hello, how are you",
             "Hola cómo estás",
             "I've had a great day"]

for sentence in sentences:
    print(detect(sentence)) # outputs 'en', 'es', 'en'

Hope this helps, happy to answer any follow up questions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM