與 Spacy 一起永遠進行詞形還原

Question

我正在嘗試使用 spacy 對 dataframe 中的聊天記錄進行詞形還原。 我的代碼是：

nlp = spacy.load("es_core_news_sm")
df["text_lemma"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))

我有大約 600.000 行，應用需要兩個多小時才能執行。 是否有更快的包/方式來進行詞形還原？ （我需要一個適用於西班牙語的解決方案）

我只嘗試過使用 spacy package

Answer 1

處理速度的減慢來自通過nlp()對 spaCy 管道的多次調用。 處理大文本的更快方法是使用nlp.pipe()命令將它們作為 stream 處理。 當我在 5000 行虛擬文本上對此進行測試時，與原始方法相比，它的速度提高了約 3.874 倍（約 9.759 秒對約 2.519 秒）。 如果需要，有一些方法可以進一步改進這一點，請參閱我制作的 spaCy 優化清單。

解決方案

# Assume dataframe (df) already contains column "text" with text

# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")

# Process large text as a stream via `nlp.pipe()` and iterate over the results, extracting lemmas
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma"] = lemma_text_list

測試時間的完整代碼

import spacy
import pandas as pd
import time

# Random Spanish sentences
rand_es_sentences = [
    "Tus drafts influirán en la puntuación de las cartas según tu número de puntos DCI.",
    "Información facilitada por la División de Conferencias de la OMI en los cuestionarios enviados por la DCI.",
    "Oleg me ha dicho que tenías que decirme algo.",
    "Era como tú, muy buena con los ordenadores.",
    "Mas David tomó la fortaleza de Sion, que es la ciudad de David."]

# Duplicate sentences specified number of times
es_text = [sent for i in range(1000) for sent in rand_es_sentences]
# Create data-frame
df = pd.DataFrame({"text": es_text})
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")


# Original method (very slow due to multiple calls to `nlp()`)
t0 = time.time()
df["text_lemma_1"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
t1 = time.time()
print("Total time: {}".format(t1-t0))  # ~9.759 seconds on 5000 rows


# Faster method processing rows as stream via `nlp.pipe()`
t0 = time.time()
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
    lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma_2"] = lemma_text_list
t1 = time.time()
print("Total time: {}".format(t1-t0))  # ~2.519 seconds on 5000 rows

Answer 2

apply 方法在處理像您這樣的大型數據集時可能會很慢，因為它將 function 按順序應用於 dataframe 的每一行。

您可以嘗試使用 concurrent.futures 模塊來並行化詞形還原過程，這可以加快執行時間。 以下是您可能如何使用它的示例：

from concurrent.futures import ProcessPoolExecutor, as_completed

def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([w.lemma_ for w in doc])

with ProcessPoolExecutor() as executor:
    future_lemmas = {executor.submit(lemmatize_text, text): text for text in df["text"]}
    for future in as_completed(future_lemmas):
        text = future_lemmas[future]
        lemmas = future.result()
        df.loc[df["text"] == text, "text_lemma"] = lemmas

這將使用多個進程並行地對文本進行詞形還原，這可以顯着加快該過程。

另一種選擇是使用另一個 package，例如 NLTK 或 Pattern，對於此任務，它們比 spacy 更快，特別是如果您只需要詞形還原。

最后，您可以考慮使用預訓練的 model，例如 Flair 或 polyglot-neural 進行詞形還原，這些模型對於西班牙語文本來說既快速又准確。

與 Spacy 一起永遠進行詞形還原

問題描述

1 個解決方案

解決方案1
1 已采納 2023-01-23 22:00:52

解決方案

測試時間的完整代碼

解決方案2
0 2023-01-23 23:29:26

與 Spacy 一起永遠進行詞形還原

問題描述

1 個解決方案

解決方案1 1 已采納 2023-01-23 22:00:52

解決方案

測試時間的完整代碼

解決方案2 0 2023-01-23 23:29:26

解決方案1
1 已采納 2023-01-23 22:00:52

解決方案2
0 2023-01-23 23:29:26