[英]Efficient selection of rows in Pandas dataframe based on multiple conditions across columns
[英]How to create multiple pandas dataframe columns based on multiple lines in a cell, across every rows?
嗨,我正在尝试根据 [ comment
] 列单元格中的多行在我的 dataframe 中创建多个列。 源数据是一个.csv
文件
这是我的数据集示例
+---------+-----------------------------------------+
| id | comment |
+---------+-----------------------------------------+
| 123ab12 | DATE: 2/3/21 10:23:42 AM CST |
| | STAGE: 1 |
| | SCORE: 2,321 |
| | NAME: Sally |
| | HOBBY: Swimming |
| | NOTES: But she doesn't like: sun, fish |
+---------+-----------------------------------------+
| 123ab12 | DATE: 4/3/21 8:15:20 AM CST |
| | STAGE: 1 |
| | SCORE: 500 |
| | NAME: Tom |
| | HOBBY: Running |
| | AGE: 26 |
| | NOTES: He needs new pair of sport shoes |
+---------+-----------------------------------------+
这就是我想要得到的
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| id | date | stage | score | name | hobby | age | notes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 2/3/21 10:23:42 AM CST | 1 | 2,321 | Sally | Swimming | | But she doesn't like: sun, fish |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 4/3/21 8:15:20 AM CST | 1 | 500 | Tom | Running | 26 | He needs new pair of sport shoes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
注意:
AGE
行:
可能在 [ comment
] 列的NOTES
中出现两次,例如NOTES: bla bla bla: further sentence
ID
可以重复ID
和数千行我最初的想法是:
NOTES:
之前使用换行符\n
作为列分隔符(但有时似乎会出现混乱的AGE
行,或者我的大脑无法正常工作......)非常感谢您的帮助。 谢谢!
您可以使用str.extract
和带有命名捕获组的正则表达式将提取的数据直接捕获到具有相应组名的 dataframe 列中(请参阅对pandas split list into columns with regex的问题的回答)。
您可以使用评论列的固定部分(即标签和换行符)作为锚点,并使AGE:
部分可选。
DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(\n)?NOTES:(?P<notes>[\s\S]+)
解释:
ANCHOR: (?P<groupname>[\s\S]+)\n
ANCHOR:
- 这只是您的纯文本标签,即DATE:
、 STAGE:
等。(?P<groupname>
- 这会启动一个命名的捕获组。 <groupname>
直接成为 dataframe 列名。[\s\S]+
- 贪婪匹配任何一系列(至少一个)字符(包括换行符,请参阅此答案)age
,我们需要进行一些更改,因为AGE:
锚可能存在或不存在:[\s\S]+?
- AGE:
之前的最后一组锚是惰性匹配的,否则它会贪婪地包含整个AGE:
匹配后面的部分(?:AGE: )?
- AGE:
锚本身包含在一个可选的非捕获组中,因为它可能存在或不存在(?P<age>[\s\S]*?)
- 与其他人不同,年龄的命名捕获组允许为空(?:\n)?
- 尾随换行符当然也是可选的,不应被捕获总之,无论AGE:
部分是否存在( https://regex101.com/r/tn6ixo/2/ )( https://regex101.com/r/tn6ixo/1/ ),这都会在您的字符串中找到匹配项.
输入 CSV 文件( comments.csv
):
id;comments
123ab12;"DATE: 2/3/21 10:23:42 AM CST
STAGE: 1
SCORE: 2,321
NAME: Sally
HOBBY: Swimming
NOTES: But she doesn't like: sun, fish"
123ab12;"DATE: 4/3/21 8:15:20 AM CST
STAGE: 1
SCORE: 500
NAME: Tom
HOBBY: Running
AGE: 26
NOTES: He needs new pair of sport shoes"
Python 脚本:
import pandas as pd
df = pd.read_csv('comments.csv', delimiter=';')
ef = df['comments'].str.extract('DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(?:\n)?NOTES:(?P<notes>[\s\S]+)', expand=True)
结果:
date stage score name hobby age notes
0 2/3/21 10:23:42 AM CST 1 2,321 Sally Swimming But she doesn't like: sun, fish
1 4/3/21 8:15:20 AM CST 1 500 Tom Running 26 He needs new pair of sport shoes
请注意,这将产生一个 dataframe ,其中所有列都有dtype: object
。 您可能想要转换一些列,例如
ef[['stage', 'age']] = ef[['stage', 'age']].apply(pd.to_numeric)
ef['score'] = ef['score'].str.replace(',', '').astype(int)
ef[['name', 'hobby', 'notes']] = ef[['name', 'hobby', 'notes']].astype('string')
ef['date'] = pd.to_datetime(ef['date'])
请注意,后一个命令将无法正确自动识别您的时区,因为CST
是一个不明确的时区。 相反,您最终会得到幼稚的时间戳。
要添加您的时区信息,您可以使用pytz
添加时区:
import pytz
ef['date'] = ef['date'].apply(lambda x: x.replace(tzinfo=pytz.timezone('America/Chicago')))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.