![](/img/trans.png)
[英]Fill columns with values from other dataframe with corresponding id in pandas
[英]csv read columns corresponding to other columns values
我需要解析一个csv
文件。
输入:文件+名称
Index | writer | year | words
0 | Philip | 1994 | this is first row
1 | Heinz | 2000 | python is wonderful (new line) second line
2 | Thomas | 1993 | i don't like this
3 | Heinz | 1898 | this is another row
. | . | . | .
. | . | . | .
N | Fritz | 2014 | i hate man united
输出:与名称对应的所有单词的列表
l = ['python is wonderful second line', 'this is another row']
我尝试了什么?
import csv
import sys
class artist:
def __init__(self, name, file):
self.file = file
self.name = name
self.list = []
def extractText(self):
with open(self.file, 'rb') as f:
reader = csv.reader(f)
temp = list(reader)
k = len(temp)
for i in range(1, k):
s = temp[i]
if s[1] == self.name:
self.list.append(str(s[3]))
if __name__ == '__main__':
# arguements
inputFile = str(sys.argv[1])
Heinz = artist('Heinz', inputFile)
Heinz.extractText()
print(Heinz.list)
输出为:
["python is wonderful\r\nsecond line", 'this is another row']
对于包含多行单词的单元格,我该如何摆脱\\r\\n
,并且由于循环速度极慢,能否改善循环?
您可以简单地使用pandas来获取列表:
import pandas
df = pandas.read_csv('test1.csv')
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index
l = list()
for i in index:
l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list.
l
输出:
['python is wonderful second line', 'this is another row']
摆脱s[3]
的换行符:我建议使用' '.join(s[3].splitlines())
。 请参阅文档"".splitlines
,另请参见"".translate
。
改善循环:
def extractText(self):
with open(self.file, 'rb') as f:
for s in csv.reader(f):
s = temp[i]
if s[1] == self.name:
self.list.append(str(s[3]))
这样可以节省一遍数据。
但是,请考虑@ Tiny.D的建议,并尝试一下熊猫。
这至少应该更快一些,因为在读取文件时正在解析,然后去除了多余的回车符和换行符(如果有的话)。
with open(self.file) as csv_fh:
for n in csv.reader(csv_fh):
if n[1] == self.name:
self.list.append(n[3].replace('\r\n', ' ')
要折叠多个空格,可以使用正则表达式,并加快处理速度,请尝试循环理解:
import re
def extractText(self):
RE_WHITESPACE = re.compile(r'[ \t\r\n]+')
with open(self.file, 'rU') as f:
reader = csv.reader(f)
# skip the first line
next(reader)
# put all of the words into a list if the artist matches
self.list = [RE_WHITESPACE.sub(' ', s[3])
for s in reader if s[1] == self.name]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.