[英]How to use square brackets as a quote character in Pandas.read_csv
Let's say I have a text file that looks like this: 假设我有一个看起来像这样的文本文件:
Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]
What I'd like to be able to do is read that in with pandas.read_csv
, but the second row will throw an error. 我希望能够做的是用
pandas.read_csv
读取,但第二行将抛出错误。 Here is the code I'm currently using: 这是我目前使用的代码:
import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)
I've tried to set quotechar
to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. 我试图将
quotechar
设置为“[”,但是这显然只是占用了行,直到下一个打开括号并添加一个quotechar
括号会导致“找到长度为2的字符串”错误。 Any insight would be greatly appreciated. 任何见解将不胜感激。 Thanks!
谢谢!
There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons. 提供了三种主要解决方案:1)为数据框提供大量名称,以允许读入所有数据,然后对数据进行后处理,2)在方括号中查找值并在其周围加上引号,或者3)用分号替换前n个逗号。
Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? 总的来说,我认为选项3通常不是一个可行的解决方案(虽然对我的数据来说很好),因为a)如果我在一个包含逗号的列中引用了值,b)如果我的方括号列是不是最后一栏? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds.
这留下了解决方案1和2.我认为解决方案2更具可读性,但解决方案1更有效,仅运行1.38秒,而解决方案2则运行3.02秒。 The tests were run on a text file containing 18 columns and more than 208,000 rows.
测试在包含18列和超过208,000行的文本文件上运行。
I think you can replace
first 3 occurence of ,
in each line of file to ;
我认为你可以
replace
前3 occurence的,
在文件中的每一行;
and then use parameter sep=";"
然后使用参数
sep=";"
in read_csv
: 在
read_csv
:
import pandas as pd
import io
with open('file2.csv', 'r') as f:
lines = f.readlines()
fo = io.StringIO()
fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
fo.seek(0)
df = pd.read_csv(fo, sep=';')
print df
Item Date Time Location
0 1 01/01/2016 13:41 [45.2344:-78.25453]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]
2 3 01/10/2016 01:27 [51.2344:-86.24432]
Or can try this complicated approach, because main problem is, separator ,
between values in lists
is same as separator of other column values. 或者可以尝试这种复杂的方法,因为主要问题是,分隔符
,
lists
中的值与其他列值的分隔符相同。
So you need post - processing: 所以你需要后期处理:
import pandas as pd
import io
temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
0 1 2 3 4 \
0 Item Date Time Location NaN
1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN
2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242
3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN
5 6 7 8 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 41.2342:-81242] NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None
#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
Location NaN NaN
1 [45.2344:-78.25453] NaN NaN
2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]
3 [51.2344:-86.24432] NaN NaN
#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)
#subset of desired columns
print df[['Item','Date','Time','Location']]
Item Date Time Location
1 1 01/01/2016 13:41 [45.2344:-78.25453]
2 2 01/03/2016 19:11 [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3 3 01/10/2016 01:27 [51.2344:-86.24432]
I can't think of a way to trick the CSV parser into accepting distinct open/close quote characters, but you can get away with a pretty simple preprocessing step: 我想不出一种方法来欺骗CSV解析器接受不同的打开/关闭引号字符,但你可以通过一个非常简单的预处理步骤逃脱:
import pandas as pd
import io
import re
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')
with open('path/to/file.txt', 'r') as fi:
# replaced brackets with quotes, pipe into file-like object
fo = io.StringIO()
fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)
# rewind file to the beginning
fo.seek(0)
# read transformed CSV into data frame
df = pd.read_csv(fo)
print df
This gives you a result like 这会给你一个结果
Date_Time Item Location
0 2016-01-01 13:41:00 1 [45.2344:-78.25453]
1 2016-01-03 19:11:00 2 [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00 3 [51.2344:-86.24432]
Edit If memory is not an issue, then you are better off preprocessing the data in bulk rather than line by line, as is done in Max's answer . 编辑如果内存不是问题,那么最好是批量预处理数据而不是逐行预处理,就像Max的答案中所做的那样。
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)
with open('path/to/file.csv', 'r') as fi:
data = unicode(re.sub(location_regex, r'"\1"', fi.read()))
df = pd.read_csv(io.StringIO(data))
If you know ahead of time that the only brackets in the document are those surrounding the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even further (Max suggests a line-by-line version of this, but I think the iteration is unnecessary): 如果您提前知道文档中的唯一括号是围绕位置坐标的那些,并且保证它们是平衡的,那么您可以进一步简化它(Max建议逐行版本,但是我认为迭代是不必要的):
with open('/path/to/file.csv', 'r') as fi:
data = unicode(fi.read().replace('[', '"').replace(']', '"')
df = pd.read_csv(io.StringIO(data))
Below are the timing results I got with a 200k-row by 3-column dataset. 下面是我用200k-row 3列数据集得到的时序结果。 Each time is averaged over 10 trials.
每次平均超过10次试验。
We can use simple trick - quote balanced square brackets with double quotes: 我们可以使用简单的技巧 - 使用双引号引用平衡方括号:
import re
import six
import pandas as pd
data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""
print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))
pd.set_option('display.expand_frame_repr', False)
print(df)
Output: 输出:
----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
Item Date Time Location junk
0 1 01/01/2016 13:41 [45.2344:-78.25453] [aaaa,bbb]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] [0,1,2,3]
2 3 01/10/2016 01:27 [51.2344:-86.24432] [12,13]
3 4 01/30/2016 05:55 [51.2344:-86.24432,41.2342:-81242,55.5555:-81242] [45,55,65]
UPDATE : if you are sure that all square brackets are balances, we don't have to use RegEx's: 更新 :如果您确定所有方括号都是余额,我们不必使用RegEx:
import io
import pandas as pd
with open('35948417.csv', 'r') as f:
fo = io.StringIO()
data = f.readlines()
fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
fo.seek(0)
df = pd.read_csv(fo)
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.