Pandas 使用正则表达式分隔符读取 csv

Question

I have been trying to read a custom csv file like this:我一直在尝试阅读这样的自定义 csv 文件：

6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000

I tried regex in https://regex101.com/ Following code works:我在https://regex101.com/中尝试了正则表达式，以下代码有效：

this works这行得通

# https://regex101.com/
s = "6 Rotterdam NLD Zuid-Holland 593321 "
pat = r'(\d+)\s+([\D]+)\s(\d+)\s+'

m = re.match(pat,s)
m.groups() # ('6', 'Rotterdam NLD Zuid-Holland', '593321')

I got the correct answer, but when I applied the code to pandas read_csv, somehow it failed to work.我得到了正确的答案，但是当我将代码应用于 pandas read_csv 时，不知何故它无法工作。

my attempt我的尝试

import numpy as np
import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """;

sep = r'(\d+)\s+|([\D]+)\s+|(\d+)\s+'
df = pd.read_csv(StringIO(s), sep=sep,engine='python')
df

I get a lot of Nans, how to get only 3 columns?我得到了很多 Nans，如何只得到 3 列？

Column names are: ID CITY POPULATION

similar questions类似的问题

How to read the custom table in pandas which has number string number number? 如何读取 pandas 中具有数字字符串数字编号的自定义表？

Answer 1

You used the pattern to match (extract) text but in the pandas method you are splitting with the pattern.您使用该模式来匹配（提取）文本，但在 pandas 方法中，您将使用该模式进行拆分。

If there can be 1, 2 or 3 digits only at the start of each line, use如果每行的开头只能有 1、2 或 3 位数字，请使用

sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)'

See the regex demo .请参阅正则表达式演示。 You may extend it by adding more lookbehinds in the first non-capturing group.您可以通过在第一个非捕获组中添加更多后视来扩展它。

Details细节

(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+ - 1+ whitespaces ( \s+ ) that are preceded with 1 digit ( \d ), or 2 digits ( \d{2} ), or 3 digits ( \d{3} ) at the start of string ( ^ ) (?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+ - 1+ 个空格（ \s+ ），前面有字符串开头的 1 位 ( ^ \d ) 或 2 位 ( \d{2} ) 或 3 位 ( \d{3} )
| - or - 或者
\s+(?=\S+\s*$) - 1+ whitespaces followed with 1+ non-whitespace chars and then any trailing 0+ whitespaces before end of string. \s+(?=\S+\s*$) - 1+ 个空格后跟 1+ 个非空白字符，然后是字符串结尾之前的任何尾随 0+ 个空格。

Answer 2

Just to provide an alternative solution without the use of regular expressions:只是为了提供一个不使用正则表达式的替代解决方案：

You could also do the parsing of the text file in plain Python.您还可以在纯 Python 中解析文本文件。 In some cases this may be easier to maintain than a rather complex regular expression.在某些情况下，这可能比相当复杂的正则表达式更容易维护。

For this specific format we know that the first and the last number in every row have a special meaning.对于这种特定格式，我们知道每行中的第一个和最后一个数字具有特殊含义。 So I would have used split and rsplit to pick them.所以我会使用split和rsplit来选择它们。

import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """

data = []
for line in StringIO(s):
    line = line.strip()
    if not line:
        continue
    id_value, line = line.split(" ", 1)
    city, population = line.rsplit(" ", 1)

    data.append((id_value, city, population))

df = pd.DataFrame(data, columns=["id", "city", "population"])
df["id"] = pd.to_numeric(df["id"])
df["population"] = pd.to_numeric(df["population"])
print(df)

I did not do any speed measurements.我没有做任何速度测量。 Depending on the file size however, speed might not be an issue at all, anyhow.但是，根据文件大小，无论如何，速度可能根本不是问题。 But even if it is: I would have used this script to preprocess the data first (and one time only) in order to be able to use regular old pd.read_csv without additional arguments needed.但即使是这样：我会使用此脚本首先（并且仅一次）预处理数据，以便能够使用常规的旧pd.read_csv而无需额外的 arguments。

Pandas 使用正则表达式分隔符读取 csv

问题描述

this works这行得通

my attempt我的尝试

similar questions类似的问题

2 个解决方案

解决方案1
2 已采纳 2020-04-10 14:50:23

解决方案2
1 2020-04-10 15:19:30

Pandas 使用正则表达式分隔符读取 csv

问题描述

this works这行得通

my attempt我的尝试

similar questions类似的问题

2 个解决方案

解决方案1 2 已采纳 2020-04-10 14:50:23

解决方案2 1 2020-04-10 15:19:30

解决方案1
2 已采纳 2020-04-10 14:50:23

解决方案2
1 2020-04-10 15:19:30