[英]How to use Pandas to read CSV file with delimiter existing in the last field?
I have raw data in the following format: 我有以下格式的原始数据:
JobID,Publish,Expire,TitleAndDetail
7428,17/12/2006 2:00:00 PM,28/01/2007 2:00:00 PM,Project Engineer - Mechanical Looking,.....,....
7429,9/03/2006 2:00:00 PM,27/02/2007 2:00:00 PM,Supply Teacher The job is,.....,.....
As you can see the delimiter is comma, however the last column is a chunk of text with commas within. 如您所见,定界符是逗号,但是最后一列是其中包含逗号的文本块。 I am using pandas'
read_csv
function to read this CSV file. 我正在使用熊猫的
read_csv
函数读取此CSV文件。 However in pandas dataframe the text parts after the 4th comma in each line are lost. 但是,在熊猫数据框中,每行第四个逗号之后的文本部分会丢失。
raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
names=['JobID', 'Publish', 'Expire', 'TitleAndDetail'],
header=None
)
If using string.split()
function, I can specify a maxsplit
parameter which allows me to keep all the content in the last column even if there're many commas. 如果使用
string.split()
函数,则可以指定maxsplit
参数,即使有很多逗号,该参数也可以将所有内容保留在最后一列中。 Is there similar functionality in Pandas? 熊猫有类似的功能吗?
So here's a bit of a hack you could try: 因此,您可以尝试以下方法:
raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
squeeze=True,
sep="\a"
)
This should give you a Series by ignoring the ","s 这应该通过忽略“,”
Then you can do: 然后,您可以执行以下操作:
df = raw_data.str.split(",", n=4, expand=True)
df.columns = ['JobID', 'Publish', 'Expire', 'TitleAndDetail']
That should split into 4 columns and rename 那应该分成4列并重命名
You can do in this way: 您可以通过以下方式进行操作:
with open("file.csv", "r") as fp:
reader = csv.reader(fp, delimiter=",")
rows = [x[:3] + [','.join(x[3:])] for x in reader]
df = pd.DataFrame(rows)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print df
Read the file manually and then create the dataframe: 手动读取文件,然后创建数据框:
rows = []
with open('somefile.csv') as f:
keys = next(f).split(',')
for line in f:
rows.append(dict(zip(keys, line.split(',', 3))))
df = pd.DataFrame(rows)
.split
takes an optional parameter to limit the number of times it splits over the delimiter. .split
一个可选参数来限制它在定界符上分割的次数。 Passing 3 means it ignores the commas in your last field: 传递3表示它会忽略最后一个字段中的逗号:
>>> s.split(',', 3)
['7428',
'17/12/2006 2:00:00 PM',
'28/01/2007 2:00:00 PM',
'Project Engineer - Mechanical Looking,.....,....']
Next, we create a dictionary with the keys from the header row and the values from the data rows: 接下来,我们使用标题行中的键和数据行中的值创建一个字典:
>>> f = 'JobID,Publish,Expire,TitleAndDetail'.split(',')
>>> dict(zip(f, s.split(',', 3)))
{'JobID': '7428',
'Publish': '17/12/2006 2:00:00 PM',
'Expire': '28/01/2007 2:00:00 PM',
'TitleAndDetail': 'Project Engineer - Mechanical Looking,.....,....'}
Finally, we make a list of these dictionaries (in rows
), and pass this as an argument to create our data frame object. 最后,我们列出这些字典(
rows
)的列表,并将其作为参数传递来创建数据框对象。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.