[英]How to copy entire row of excel (.csv) which contain specific words into another csv file using python?
I have to copy all the rows which contain specific word into an anther csv
file.我必须将包含特定单词的所有行复制到花药
csv
文件中。
My file is in .csv
and I want to copy all rows which contain the word "Canada" in one of the cells.我的文件在
.csv
,我想复制其中一个单元格中包含“加拿大”一词的所有行。 I have tried the various method given on the internet.我已经尝试了互联网上给出的各种方法。 But I am unable to copy my rows.
但我无法复制我的行。 My data contains more than 15,000 lines.
我的数据包含超过 15,000 行。
Example of my dataset includes:我的数据集示例包括:
tweets date area
dbcjhbc 12:4:19 us
cbhjc 3:3:18 germany
cwecewc 5:6:19 canada
cwec 23:4:19 us
wncwjwk 9:8:18 canada
code is:代码是:
import csv
with open('twitter-1.csv', "r" ,encoding="utf8") as f:
reader = csv.DictReader(f, delimiter=',')
with open('output.csv', "w") as f_out:
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames, delimiter=",")
writer.writeheader()
for row in reader:
if row == 'Canada':
writer.writerow(row)
But this code is not working and I am getting the error但是这段代码不起作用,我收到了错误
Error: field larger than field limit (131072)
错误:字段大于字段限制 (131072)
I know the question is asking for a solution in Python, but I believe this task can be solved easier with command-line tools.我知道问题是在 Python 中寻求解决方案,但我相信使用命令行工具可以更轻松地解决此任务。
One-Liner using Bash:使用 Bash 的单线:
grep 'canada' myFile.csv > outputfile.csv
You can do this even without the csv module.即使没有 csv 模块,您也可以做到这一点。
# read file and split by newlines (get list of rows)
with open('input.csv', 'r') as f:
rows = f.read().split('\n')
# loop over rows and append to list if they contain 'canada'
rows_containing_keyword = [row for row in rows if 'canada' in row]
# create and write lines to output file
with open('output.csv', 'w+') as f:
f.write('\n'.join(rows_containing_keyword))
Assuming your .csv data ( twitter-1.csv
) looks like this:假设您的 .csv 数据 (
twitter-1.csv
) 如下所示:
tweets,date,area
dbcjhbc,12:4:19,us
cbhjc,3:3:18,germany
cwecewc,5:6:19,canada
cwec,23:4:19,us
wncwjwk,9:8:18,canada
Using numpy:使用 numpy:
import numpy as np
# import .csv data (skipping header)
data = np.genfromtxt('twitter-1.csv', delimiter=',', dtype='string', skip_header=1)
# select only rows where the 'area' column is 'canada'
data_canada = data[np.where(data[:,2]=='canada')]
# export the resulting data
np.savetxt("foo.csv", data_canada, delimiter=',', fmt='%s')
foo.csv
will contain: foo.csv
将包含:
cwecewc,5:6:19,canada
wncwjwk,9:8:18,canada
If you want to search every entry (every column) for canada
, then you could use list comprehension.如果您想搜索
canada
每个条目(每列),那么您可以使用列表理解。 Assume twitter-1.csv
contained an occurrence of canada
in the tweets
column:假设
twitter-1.csv
在tweets
列中包含一个canada
的出现:
tweets,date,area
dbcjhbc,12:4:19,us
cbhjc,3:3:18,germany
cwecewc,5:6:19,canada
canada,23:4:19,us
wncwjwk,9:8:18,canada
This will return all rows with any occurrence of canada
:这将返回任何出现
canada
所有行:
out = [i for i, v in enumerate(data) if 'canada' in v]
data_canada = data[out]
np.savetxt("foo.csv", data_canada, delimiter=',', fmt='%s')
Now, foo.csv
will contain:现在,
foo.csv
将包含:
cwecewc,5:6:19,canada
canada,23:4:19,us
wncwjwk,9:8:18,canada
All solutions except the grep
one (which is probably the fastest if grep
is available) load the entire .csv file into memory.除了
grep
之外的所有解决方案(如果grep
可用,这可能是最快的)将整个 .csv 文件加载到内存中。 Don't do that!不要那样做! You can stream the file and keep only one line in memory at a time.
您可以流式传输文件并一次仅在内存中保留一行。
with open('input.csv', 'r') as if, open('output.csv', 'w') as of:
for line in if:
if 'canada' in line:
of.write(line)
NOTE: I don't actually have python3 on this computer, so there might be a typo on this code.注意:我实际上在这台计算机上没有 python3,所以这段代码可能有错字。 But I'm confident it's more efficient on sufficiently large files than loading the entire file into memory before manipulating it.
但是我相信它在足够大的文件上比在操作之前将整个文件加载到内存中更有效。 It would be interesting to see benchmarks.
看到基准测试会很有趣。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.