使用python根据特定字段重新格式化CSV

Question

http://example.com/item/all-atv-quad.html,David,"Punjab",+123456789123
http://example.com/item/70cc-2014.html,Qubee,"Capital",+987654321987
http://example.com/item/quad-bike-zenith.html,Zenith,"UP",+123456789123

I have this test.csv where I have scraped a few items from certain site but the thing is "number" field has redundancy. 我有这个test.csv，在这里我从某些站点刮了一些项目，但是“数字”字段具有冗余性。 So I somehow need to remove a row that has the same number as before. 因此，我需要以某种方式删除具有与之前相同编号的行。 This is just the example file, In the real file some numbers are repeated more than 50+ times. 这只是示例文件，在实际文件中，某些数字重复了50多次以上。

import csv

with open('test.csv', newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',')

    for column in csvreader:

        "Some logic here"

        if (column[3] == "+123456789123"):
            print (column[0])

            "or here"

I need reformated csv like this: 我需要这样重新格式化的csv：

http://example.com/item/all-atv-quad.html,David,"Punjab",+123456789123
http://example.com/item/70cc-2014.html,Qubee,"Capital",+987654321987

Answer 1

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import pandas as pd


def direct():
    seen = set()
    with open("test.csv") as infile, open("formatted.csv", 'w') as outfile:
        for line in infile:
            parts = line.rstrip().split(',')
            number = parts[-1]
            if number not in seen:
                seen.add(number)
                outfile.write(line)


def using_pandas():
    """Alternatively, use Pandas"""
    df = pd.read_csv("test.csv", header=None)
    df = df.drop_duplicates(subset=[3])
    df.to_csv("formatted_pandas.csv", index=None, header=None)


def main():
    direct()
    using_pandas()


if __name__ == "__main__":
    main()

Answer 2

This would filter out duplicates: 这将过滤出重复项：

seen = set()
for line in csvreader:
    if line[3] in seen:
        continue
    seen.add(line[3])
    # write line to output file

And the csv read and write logic: 和csv读写逻辑：

with open('test.csv') as fobj_in, open('test_clean.csv', 'w') as fobj_out:
    csv_reader = csv.reader(fobj_in, delimiter=',')
    csv_writer = csv.writer(fobj_out, delimiter=',')
    seen = set()
    for line in csvreader:
        if line[3] in seen:
            continue
        seen.add(line[3])
        csv_writer.writerow(line)

使用python根据特定字段重新格式化CSV

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-12-05 09:44:50

解决方案2
1 2015-12-05 09:43:54

使用python根据特定字段重新格式化CSV

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-12-05 09:44:50

解决方案2 1 2015-12-05 09:43:54

解决方案1
2 已采纳 2015-12-05 09:44:50

解决方案2
1 2015-12-05 09:43:54