csv读取与其他列值相对应的列

Question

我需要解析一个csv文件。

输入：文件+名称

Index   |   writer   |  year  |  words
  0     |   Philip   |  1994  | this is first row 
  1     |   Heinz    |  2000  | python is wonderful (new line) second line
  2     |   Thomas   |  1993  | i don't like this
  3     |   Heinz    |  1898  | this is another row
  .     |     .      |    .   |      .
  .     |     .      |    .   |      .
  N     |   Fritz    |  2014  | i hate man united

输出：与名称对应的所有单词的列表

l = ['python is wonderful second line', 'this is another row']

我尝试了什么？

import csv
import sys

class artist:
    def __init__(self, name, file):
        self.file = file 
        self.name = name
        self.list = [] 

    def extractText(self):
        with open(self.file, 'rb') as f:
            reader = csv.reader(f)
            temp = list(reader)
        k = len(temp)
        for i in range(1, k):
            s = temp[i]
            if s[1] == self.name:
                self.list.append(str(s[3]))


if __name__ == '__main__':
    # arguements
    inputFile = str(sys.argv[1])
    Heinz = artist('Heinz', inputFile)
    Heinz.extractText()
    print(Heinz.list)

输出为：

["python is wonderful\r\nsecond line", 'this is another row']

对于包含多行单词的单元格，我该如何摆脱\\r\\n ，并且由于循环速度极慢，能否改善循环？

Answer 1

您可以简单地使用pandas来获取列表：

import pandas
df = pandas.read_csv('test1.csv')
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index
l = list()
for i in index:
    l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list.
l

输出：

['python is wonderful second line', 'this is another row']

Answer 2

摆脱s[3]的换行符：我建议使用' '.join(s[3].splitlines()) 。 请参阅文档"".splitlines ，另请参见"".translate 。

改善循环：

def extractText(self):
    with open(self.file, 'rb') as f:
        for s in csv.reader(f):
            s = temp[i]
            if s[1] == self.name:
                self.list.append(str(s[3]))

这样可以节省一遍数据。

但是，请考虑@ Tiny.D的建议，并尝试一下熊猫。

Answer 3

这至少应该更快一些，因为在读取文件时正在解析，然后去除了多余的回车符和换行符（如果有的话）。

with open(self.file) as csv_fh:
     for n in csv.reader(csv_fh):
         if n[1] == self.name:
            self.list.append(n[3].replace('\r\n', ' ')

Answer 4

要折叠多个空格，可以使用正则表达式，并加快处理速度，请尝试循环理解：

import re

def extractText(self):
    RE_WHITESPACE = re.compile(r'[ \t\r\n]+')
    with open(self.file, 'rU') as f:
        reader = csv.reader(f)

        # skip the first line
        next(reader)

        # put all of the words into a list if the artist matches
        self.list = [RE_WHITESPACE.sub(' ', s[3])
                     for s in reader if s[1] == self.name]

csv读取与其他列值相对应的列

问题描述

4 个解决方案

解决方案1
1 2017-05-07 23:27:13

解决方案2
1 2017-05-07 23:33:47

解决方案3
1 已采纳 2017-05-07 23:37:33

解决方案4
0 2017-05-07 23:39:28

csv读取与其他列值相对应的列

问题描述

4 个解决方案

解决方案1 1 2017-05-07 23:27:13

解决方案2 1 2017-05-07 23:33:47

解决方案3 1 已采纳 2017-05-07 23:37:33

解决方案4 0 2017-05-07 23:39:28

解决方案1
1 2017-05-07 23:27:13

解决方案2
1 2017-05-07 23:33:47

解决方案3
1 已采纳 2017-05-07 23:37:33

解决方案4
0 2017-05-07 23:39:28