简体   繁体   English

我需要删除括号以进行标记化吗? 正则表达式分词器

[英]Do i need to remove brackets for tokenization? RegexpTokenizer

First attempt at tokenization using nltk's RegexpTokenizer for an assignment (necessary).第一次尝试使用 nltk 的 RegexpTokenizer 进行标记化(必要的)。 Not sure if I should remove brackets?不确定我是否应该删除括号?

You are required to extract the token and append them into the list 'token'您需要将令牌和 append 提取到“令牌”列表中

...not sure if I even did this right. ...不确定我是否做得对。

import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.probability import *
from itertools import chain
from tqdm import tqdm
import codecs
from nltk.corpus import stopwords 
nltk.download('stopwords')

df_text = pd.read_csv(r"C:\Users\User\Downloads\JobPostings.csv")

lower = []
for item in df_text['job_description']:
    lower = [item]
    lower.append(item.lower())

tokenizer_test = RegexpTokenizer(r"\s+", gaps=True)
tokens_test = tokenizer_test.tokenize(item)

token = [tokens_test]
print(token)

Output is: Output 是:

[['Data', 'Scientist,', '(Staff', 'or', 'Principal)', 'at', 'realtor.com', '(View', 'all', 'jobs)', 'Santa', 'Clara,', 'CA', 'At', 'realtor.com,', 'we', 'process', 'terabytes', 'of', 'data', 'every', 'day', 'and', 'transform', 'that', 'data', 'into', 'information', 'that', 'powers', 'decisions', 'for', 'millions', 'of', 'homebuyers,', 'renters,', 'dreamers,', 'and', 'real', 'estate', 'professionals.', 'We', 'aim', 'to', 'radically', 'simplify', 'home', 'buying/selling', 'and', 'help', 'more', 'people', 'achieve', 'the', 'American', 'dream', 'on', 'our', 'realtor.com', 'website', 'and', 'mobile', 'apps.', 'We', 'seek', 'a', 'highly', 'seasoned', 'Data', 'Scientist', 'to', 'join', 'our', 'data', 'science', 'program', 'and', 'help', 'develop', 'it', 'to', 'its', 'full', 'potential.', 'As', 'a', 'key', 'member', 'of', 'the', 'data', 'science', 'team,', 'you', 'will', 'be', 'responsible', 'for', 'the', 'development', 'of', 'innovative', 'concepts,', 'research, [['Data', 'Scientist,', '(Staff', 'or', 'Principal)', 'at', 'realtor.com', '(View', 'all', 'jobs)', ' Santa', 'Clara,', 'CA', 'At', 'realtor.com,', 'we', 'process', 'terabytes', 'of', 'data', 'every', 'day' , 'and', 'transform', 'that', 'data', 'into', '信息', 'that', 'powers', 'decisions', 'for', '百万', 'of', '购房者,','租房者,','梦想家,','和','真实','房地产','专业人士','我们','目标','to','从根本上','简化' , 'home', 'buying/selling', 'and', 'help', 'more', 'people', 'achieve', 'the', 'American', 'dream', 'on', 'our' , 'realtor.com', 'website', 'and', 'mobile', 'apps.', 'We', 'seek', 'a', 'highly', '老练', 'Data', 'Scientist ', 'to', 'join', 'our', 'data', 'science', 'program', 'and', 'help', 'develop', 'it', 'to', 'its', 'full', 'potential.', 'As', 'a', 'key', 'member', 'of', 'the', 'data', 'science', 'team,', 'you', “将”、“成为”、“负责”、“为”、“该”、“发展”、“的”、“创新”、“概念”、“研究”、 ', 'predictive', 'modeling,', 'and', 'machine', 'learning', 'algorithms.', 'Responsibilities:', 'Perform', 'exploratory', 'analysis', 'on', "realtor.com's", 'wealth', 'of', 'data', 'including', 'consumer', 'web', 'and', 'mobile', 'behavior', 'and', 'North', 'America's', 'most', 'comprehensive', 'and', 'up-to-date', 'listings', 'and', 'properties', 'data', 'set.', 'Effectively', 'partner', 'with', 'product', 'and', 'engineering', 'teams', 'to', 'build', 'new', 'data-driven', 'and', 'machine', 'learning-based', 'features', 'in', 'our', 'professional', 'software', 'and', 'lead', 'monetization', 'products', 'to', 'enable', 'real', 'state', 'professionals', 'to', 'be', 'more', 'productive', 'and', 'effective', 'in', 'serving', 'the', 'needs', 'of', 'home', 'shoppers.', 'Help', 'improve', 'the', 'scope', 'our', 'data', 'sets', 'by', 'identifying', 'new', 'data', 'collection', 'and', 'procurement', 'opportunities', 'on', 'an', 'ongoing', 'basis', 'Drive', 'A/B,', 'multivariate', 'tests', 'and ','预测','建模','和','机器','学习','算法','职责:','执行','探索','分析','on',“ realtor.com's", 'wealth', 'of', 'data', '包括', 'consumer', 'web', 'and', 'mobile', 'behavior', 'and', 'North', ' America's', '最', 'comprehensive', 'and', 'up-to-date', 'listings', 'and', 'properties', 'data', 'set.', 'Effectively', 'partner '、'with'、'product'、'and'、'engineering'、'teams'、'to'、'build'、'new'、'data-driven'、'and'、'machine'、'learning -based'、'features'、'in'、'our'、'professional'、'software'、'and'、'lead'、'monetization'、'products'、'to'、'enable'、'real ', '状态', '专业人员', 'to', 'be', 'more', 'productive', 'and', 'effective', 'in', 'serving', 'the', 'needs', 'of'、'home'、'shoppers.'、'Help'、'improve'、'the'、'scope'、'our'、'data'、'sets'、'by'、'identifying'、'新的','数据','收集','和','采购','机会','on','an','ongoing','基础','Drive','A / B,', '多变量','测试','和', 'design', 'of', 'experiments', 'to', 'facilitate', 'testing', 'of', 'new', 'product', 'and', 'design', 'features,', 'with', 'a', 'focus', 'on', 'improving', 'engagement,', 'retention,', 'and', 'conversion.', 'Select,', 'apply,', 'and', 'tune', 'a', 'diverse', 'set', 'of', 'tools', 'to', 'coherently', 'solve', 'challenging', 'business', 'goals', 'Create', 'automated', 'learning', 'systems', 'that', 'gracefully', 'scale', 'to', 'increasing', 'complexity', 'and', 'expectation', 'Develop', 'predictive,', 'explanatory', 'models', 'and', 'machine', 'learning', 'algorithms', 'Generate', 'descriptive', 'visualizations', 'and', 'presentations', 'to', 'communicate', 'insights', 'Mentor', 'a', 'team', 'of', 'data', 'scientists', 'on', 'data', 'exploration,', 'machine', 'learning', 'and', 'developing', 'data-based', 'products', 'Work', 'with', 'a', 'sense', 'of', 'ownership', 'and', 'urgency,', 'advocate', 'for', 'experimentation', 'based,', 'agile', 'culture.', 'Requirements:', 'MS', 'or', 'Ph ', '设计', 'of', 'experiments', 'to', 'facilitate', 'testing', 'of', 'new', 'product', 'and', 'design', 'features,' , 'with', 'a', 'focus', 'on', 'improving', 'engagement,', 'retention,', 'and', 'conversion.', 'Select,', 'apply,', 'and'、'tune'、'a'、'diverse'、'set'、'of'、'tools'、'to'、'coherently'、'solve'、'challenging'、'business'、'goals ','创造','自动化','学习','系统','那个','优雅','规模','到','增加','复杂性','和','期望', “开发”、“预测”、“解释”、“模型”、“和”、“机器”、“学习”、“算法”、“生成”、“描述”、“可视化”、“和”、“演示文稿','to','communicate','insights','Mentor','a','team','of','data','scientists','on','data','exploration, '、'机器'、'学习'、'和'、'开发'、'基于数据的'、'产品'、'工作'、'与'、'a'、'sense'、'of'、'所有权', 'and', 'urgency,', 'advocate', 'for', 'experimentation', 'based,', 'agile', 'culture.', 'Requirements:', 'MS', 'or', '博士.D.', 'in', 'statistics,', 'mathematics,', 'operations', 'research,', 'computer', 'science,', 'quantitative', 'analysis,', 'economics', 'or', 'related', 'field', 'is', 'required.', '7+', 'years', 'of', 'relevant', 'experience', 'in', 'data', 'science,', 'data', 'analytics,', 'or', 'applied', 'statistics,', 'Experience', 'with', 'machine', 'learning,', 'NLP,', 'data', 'mining,', 'statistical', 'modeling', 'tools,', 'and', 'underlying', 'algorithms', 'Experienced', 'in', 'R,', 'Perl,', 'Python,', 'Spark,', 'or', 'other', 'languages', 'and', 'frameworks', 'appropriate', 'for', 'large', 'scale', 'analysis', 'of', 'numerical,', 'textual,', 'image,', 'and', 'video', 'data', 'Strong', 'skills', 'in', 'data', 'gathering,', 'massaging', 'and', 'featurization', 'Working', 'experience', 'with', 'relational', 'databases', 'and', 'SQL', 'Experience', 'with', 'experiment', 'design', 'and', 'A/B', 'and', 'multivariate', 'tests', 'Experience', 'and', 'proven', 'track', 'record', 'developing', 'online' .D.'、'in'、'统计学'、'数学'、'运算'、'研究'、'计算机'、'科学'、'定量'、'分析'、'经济学'、 'or', 'related', 'field', 'is', 'required.', '7+', 'years', 'of', 'relevant', 'experience', 'in', 'data', “科学”、“数据”、“分析”、“或”、“应用”、“统计”、“经验”、“与”、“机器”、“学习”、“NLP”、“数据','挖掘,','统计','建模','工具','和','基础','算法','经验','in','R,','Perl,' , 'Python,', 'Spark,', 'or', 'other', 'languages', 'and', 'frameworks', '适当的', 'for', 'large', 'scale', 'analysis' , 'of', 'numerical,', 'textual,', 'image,', 'and', 'video', 'data', 'Strong', 'skills', 'in', 'data', 'gathering ,', '按摩', 'and', '特征化', 'Working', 'experience', 'with', 'relational', 'databases', 'and', 'SQL', 'Experience', 'with' , 'experiment', 'design', 'and', 'A/B', 'and', 'multivariate', 'tests', 'Experience', 'and', 'proven', 'track', 'record' , '开发中', '在线' , 'data', 'products', 'Strong', 'creative', 'thinking', 'and', 'problem-solving', 'skills', 'Excellent', 'oral', 'and', 'written', 'communication', 'and', 'presentation', 'skills']] , '数据', '产品', '强', '创意', '思考', '和', '解决问题', '技能', '优秀', '口头', '和', '书面' , '沟通', '和', '演讲', '技能']]

edit: tried this out instead... thoughts?编辑:改为尝试这个......想法?

df_text_jd = df_text.job_description

lower = []
for item in df_text_jd:
    lower.append(item.lower().replace('(','').replace(')',''))

l = []  
for token in item:
    tokenizer_test = RegexpTokenizer(r'\s+', gaps=True)
    token = tokenizer_test.tokenize(item)

    l.append(token)

l

You can replace the brackets by modifying the line where you append the lower case item to lower list:您可以通过修改 append 小写项目的行来替换括号:

lower.append(item.lower().replace('(','').replace(')',''))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM