簡體   English   中英

Python:從csv逐行提取關鍵字

[英]Python: extract keywords row by row from csv

我試圖從csv文件中逐行提取關鍵字並創建關鍵字字段。 現在我能夠完全提取。 如何獲取每行/每個字段的關鍵字?

數據:

id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"

代碼:這是搜索整個文本,但不是逐行搜索。 除了replace(r'\\|', ' ')之外我還需要放其他東西嗎?

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df = pd.read_csv('test-data.csv')
# print(df.head(5))

text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)

最終輸出:

id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"

以下是使用pandas apply向數據框添加新關鍵字列的簡潔方法。 應用首先定義一個函數(在我們的例子中為get_keywords ),我們可以它們應用於每一行或每列。

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('test-data.csv', index_col='id')  

這里我們定義我們的函數,它將在下一個單元格中使用df.apply應用於每一行。 你可以看到,這個功能get_keywords需要一個row作為它的參數和返回逗號分隔的關鍵字串像你所需輸出具有上述(“之意,字,喜馬拉雅”)。 在這個函數中,我們降低,標記化,使用isalpha()過濾掉標點符號,過濾掉我們的stop_words,並將我們的關鍵字連接在一起以形成所需的輸出。

# This function will be applied to each row in our Pandas Dataframe
# See the docs for df.apply at: 
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
def get_keywords(row):
    some_text = row['some_text']
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(lowered)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ','.join(keywords)
    return keywords_string

現在我們已經定義了將要應用的函數,我們調用df.apply(get_keywords, axis=1) 這將返回一個Pandas系列(類似於列表)。 由於我們希望此系列成為數據df['keywords'] = df.apply(get_keywords, axis=1)的一部分,因此我們使用df['keywords'] = df.apply(get_keywords, axis=1)將其添加為新列

# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column
df['keywords'] = df.apply(get_keywords, axis=1)

輸出: 添加“關鍵字”列后的數據框

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM