[英]How to edit .csv in python to proceed NLP
Hello i am not very familiar with programming and found Stackoverflow while researching my task. 你好我对编程并不熟悉,在研究我的任务时发现了Stackoverflow。 I want to do natural language processing on a .csv file that looks like this and has about 15.000 rows
我想在.csv文件上进行自然语言处理,该文件看起来像这样并且有大约15.000行
ID | Title | Body
----------------------------------------
1 | Who is Jack? | Jack is a teacher...
2 | Who is Sam? | Sam is a dog....
3 | Who is Sarah?| Sarah is a doctor...
4 | Who is Amy? | Amy is a wrestler...
I want to read the .csv file and do some basic NLP operations and write the results back in a new or in the same file. 我想阅读.csv文件并执行一些基本的NLP操作,并将结果写回新文件或同一文件中。 After some research python and nltk seams to be the technologies i need.
经过一些研究python和nltk接缝成为我需要的技术。 (i hope thats right).
(我希望是对的)。 After tokenizing i want my .csv file to look like this
标记后我希望我的.csv文件看起来像这样
ID | Title | Body
-----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Sam" "is" "a" "dog"....
3 | "Who" "is" "Sarah" "?"| "Sarah" "is" "a" "doctor"...
4 | "Who" "is" "Amy" "?" | "Amy" "is" "a" "wrestler"...
What i have achieved after a day of research and putting pieces together looks like this 经过一天的研究和拼凑后我取得的成就就像这样
ID | Title | Body
----------------------------------------------------------
1 | "Who" "is" "Jack" "?" | "Jack" "is" "a" "teacher"...
2 | "Who" "is" "Sam" "?" | "Jack" "is" "a" "teacher"...
3 | "Who" "is" "Sarah" "?"| "Jack" "is" "a" "teacher"...
4 | "Who" "is" "Amy" "?" | "Jack" "is" "a" "teacher"...
My first idea was to read a specific cell in the .csv ,do an operation and write it back to the same cell. 我的第一个想法是读取.csv中的特定单元格,执行操作并将其写回同一单元格。 And than somehow do that automatically on all rows.
并且不知何故在所有行上自动执行此操作。 Obviously i managed to read a cell and tokenize it.
显然,我设法读取一个单元格并将其标记化。 But i could not manage to write it back in that specific cell.
但我无法在特定的细胞中写回来。 And i am far away from "do that automatically to all rows".
而且我远离“自动执行所有行”。 I would appreciate some help if possible.
如果可能的话,我将不胜感激。
My code: 我的代码:
import csv
from nltk.tokenize import word_tokenize
############Read CSV File######################
########## ID , Title, Body####################
line_number = 1 #line to read (need some kind of loop here)
column_number = 2 # column to read (need some kind of loop here)
with open('test10in.csv', 'rb') as f:
reader = csv.reader(f)
reader = list(reader)
text = reader[line_number][column_number]
stringtext = ''.join(text) #tokenizing just work on strings
tokenizedtext = (word_tokenize(stringtext))
print(tokenizedtext)
#############Write back in same cell in new CSV File######
with open('test11out.csv', 'wb') as g:
writer = csv.writer(g)
for row in reader:
row[2] = tokenizedtext
writer.writerow(row)
I hope i asked the question correctly and someone can help me out. 我希望我能正确地提出这个问题,有人可以帮助我。
The pandas library will make all of this much easier. 大熊猫图书馆将使这一切变得更加容易。
pd.read_csv() will handle the input much more easily, and you can apply the same function to a column using pd.DataFrame.apply() pd.read_csv()将更容易处理输入,您可以使用pd.DataFrame.apply()将相同的函数应用于列
Here's a quick example of how the key parts you'll want work. 这是一个快速的例子,说明你想要工作的关键部分。 In the .applymap() method, you can replace my lambda function with word_tokenize() to apply that across all elements instead.
在.applymap()方法中,您可以使用word_tokenize()替换我的lambda函数,以将其应用于所有元素。
In [58]: import pandas as pd
In [59]: pd.read_csv("test.csv")
Out[59]:
0 1
0 wrestler Amy dog is teacher dog dog is
1 is wrestler ? ? Sarah doctor teacher Jack
2 a ? Sam Sarah is dog Sam Sarah
3 Amy a a doctor Amy a Amy Jack
In [60]: df = pd.read_csv("test.csv")
In [61]: df.applymap(lambda x: x.split())
Out[61]:
0 1
0 [wrestler, Amy, dog, is] [teacher, dog, dog, is]
1 [is, wrestler, ?, ?] [Sarah, doctor, teacher, Jack]
2 [a, ?, Sam, Sarah] [is, dog, Sam, Sarah]
3 [Amy, a, a, doctor] [Amy, a, Amy, Jack]
Also see: http://pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application 另见: http : //pandas.pydata.org/pandas-docs/stable/basics.html#row-or-column-wise-function-application
You first need to parse your file and then process (tokenize, etc.) each field separately. 首先需要解析文件,然后分别处理(标记化等)每个字段。
If our file really looks like your sample, I wouldn't call it a CSV. 如果我们的文件看起来像你的样本,我不会称之为CSV。 You could parse it with the
csv
module, which is specifically for reading all sorts of CSV files: Add delimiter="|"
您可以使用
csv
模块解析它,该模块专门用于读取各种CSV文件:Add delimiter="|"
to the arguments of csv.reader()
, to separate your rows into cells. 到
csv.reader()
的参数,将行分成单元格。 (And don't open the file in binary mode.) But your file is easy enough to parse directly: (并且不要以二进制模式打开文件。)但是您的文件很容易直接解析:
with open('test10in.csv', encoding="utf-8") as fp: # Or whatever encoding is right
content = fp.read()
lines = content.splitlines()
allrows = [ [ fld.strip() for fld in line.split("|") ] for line in lines ]
# Headers and data:
headers = allrows[0]
rows = allrows[2:]
You can then use nltk.word_tokenize()
to tokenize each field of rows
, and go on from there. 然后,您可以使用
nltk.word_tokenize()
来标记rows
每个字段,然后从那里继续。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.