简体   繁体   English

从python的句子中提取单词

[英]Extract words from sentence in python

I have a dataset in a text/csv file format. 我有一个text / csv文件格式的数据集。 It has 2 columns like this = 它有2列,像这样=

ID - TEXT
1 - this probability is 10-15% 
2 - approximately 20% probablity 
3 - 15% probability 

I am trying to use NLTK to extract the number from the data where there is the keyword 'Probability' present. 我正在尝试使用NLTK从存在关键字'Probability'的数据中提取数字。

This is what my code looks like. 这就是我的代码。

import pandas as pd
import nltk
from nltk import sent_tokenize, word_tokenize

data_file = pd.read_excel(r'data_excel.xlsx',sheet_name = 'data')

df = pd.DataFrame(data_file, columns = ['ID','TEXT'])
keywords = ["probability"]

id_text = nltk.Text(str(df.ID).splitlines()) 
text_value = nltk.Text(str(df.TEXT).splitlines())

I want the output to look like this - 我希望输出看起来像这样-

ID - Value 
1 - 10
2 - 20
3 - 15

If someone can nudge in the right direction, it will be very helpful. 如果有人可以向正确的方向轻推,那将非常有帮助。

THIS CODE SHOULD WORK OR AT LEAST POINT YOU INTO SOLVING IT Here is the full code 此代码应该起作用,或者至少可以解决它, 这是完整的代码

import csv
import nltk
impor re
import pandas as pd
from nltk import sent_tokenize, word_tokenize

tweet = []

data_file = pd.read_excel(r'data_excel.xlsx',sheet_name = 'data')
df = pd.DataFrame(data_file, columns = ['ID','TEXT'])


cols = ['ID', 'Num']
newDataFrame = pd.DataFrame(columns=cols)


#this should provide you with a list of both ID and txt
ID = df.iloc[:,0].values
TEXT  = df.iloc[:,1].values


#loop throug the id and set occurence of the number of probability
for i in range(1, len(ID)):
    number_list = re.findall(r'\b\d+\b', TEXT[i])

    newDataFrame.iloc[i].ID = ID
    newDataFrame.iloc[i].Num = number_list

print(newDataFrame)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM