在熊猫中读取csv时忽略多个逗号

Question

I m trying to read multiple files whose names start with 'site_%'. 我正在尝试读取名称以'site_％'开头的多个文件。 Example, file names like site_1, site_a. 例如，文件名类似site_1，site_a。 Each file has data like : 每个文件的数据如下：

Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php

I need two columns in my pandas df: Login_id and Web. 我的熊猫df中需要两列：Login_id和Web。

I am facing error when I try to read records like 2. 尝试读取类似2的记录时遇到错误。

df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)

I am facing the following error : ValueError: Columns must be same length as key. 我遇到以下错误：ValueError：列的长度必须与键的长度相同。

Please let me know where I am doing some serious mistake and any good approach to solve the problem. 请让我知道我在哪里犯一些严重的错误，以及解决问题的任何好的方法。 Thanks 谢谢

Answer 1

Solution 1: use split with argument n=1 and expand=True . 解决方案1：将split与参数n=1一起使用，然后expand=True 。

result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']

That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method). 这将导致一个具有两列的数据框，因此，如果数据框中有更多的列，则需要将其与原始数据框连接（这也适用于下一个方法）。

EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function: 编辑解决方案2：有一个更好的基于正则表达式的解决方案，它使用了pandas函数：

result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)

This splits the field and uses the names of the matching groups to create columns with their content. 这将拆分字段，并使用匹配组的名称来创建包含其内容的列。 The output is: 输出为：

  Login_id                       URL
0        1         http://www.x1.com
1        2  http://www.x1.com,as.php

Solution 3: convetional version with regex: You could do something customized, eg with a regex: 解决方案3：使用正则表达式的常规版本：您可以进行一些自定义的操作，例如使用正则表达式：

import re
sp_re= re.compile('([^,]*),(.*)')

aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]

The result on your example data is: 您的示例数据的结果是：

                Login_id, Web Login_id                       URL
0         1,http://www.x1.com        1         http://www.x1.com
1  2,http://www.x1.com,as.php        2  http://www.x1.com,as.php

Now you could drop the column 'Login_id, Web'. 现在，您可以删除“ Login_id，Web”列。

在熊猫中读取csv时忽略多个逗号

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-22 15:33:30

在熊猫中读取csv时忽略多个逗号

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-22 15:33:30

解决方案1
1 已采纳 2019-07-22 15:33:30