简体   繁体   English

从pandas和python中的csv列复制一系列文本

[英]Copy a range of text from csv column in pandas and python

I have a csv file that I have imported into Pandas. 我有一个已导入Pandas的csv文件。 Now it has almost 45 columns of data and each column has more than 100 lines of information. 现在它有近45列数据,每列有100多行信息。 Now I need to select only the range of text that starts with a Date Stamp at the start and ends with a Date Stamp. 现在我只需要选择以开头的日期戳开头并以日期戳结束的文本范围。

Ex : 例如:

<GMT2015-09-01 00:03:29GMT> Hi Rajiv<GMT2015-09-01 19:08:15GMT> Hi Ram <GMT2015-09-01 19:08:15GMT>

So, in such structure I need to select only the first paragraph of datestamp to datestamp into a new data frame. 因此,在这种结构中,我只需要选择datestamp的第一段到datestamp到一个新的数据框。

I think you can split data in column Ticket Description by <> and then select output DataFrame by iloc . 我想你可以split列数据Ticket Description<>然后选择输出DataFrameiloc Last you can strip starts and ends whitespaces. 最后你可以strip开始和结束空格。

Notice: It works nice if <> are only in start and end od each datetime. 注意:如果<>仅在每个日期时间的开始和结束时它都很好用。

import pandas as pd

df = pd.DataFrame({'Ticket Description':['<GMT2015-09-01 00:03:29GMT> Hi Rajiv<GMT2015-09-01 19:08:15GMT> Hi Ram <GMT2015-09-01 19:08:15GMT> ']})
print (df)
                                  Ticket Description
0  <GMT2015-09-01 00:03:29GMT> Hi Rajiv<GMT2015-0...

print (df['Ticket Description'].str.split(r'[<>]', expand=True).iloc[:, 2].str.strip())
0    Hi Rajiv
Name: 2, dtype: object

Regex and pandas apply should achieve what you want. 正则表达式和大熊猫适用应该达到你想要的。 I'm assuming u want only the text between the very first and second timestamp. 我假设你只想要第一个和第二个时间戳之间的文本。 I have created a dataframe with your message, except the second one starts with 2. >(.+?)< in the regex searches for the any number of characters surrounded by a > and < . 我已经用你的消息创建了一个数据帧,除了第二个以2. >(.+?)<在正则表达式中搜索由><包围的任意数量的字符。 The ? ? makes it non greedy so it doesnt go from the first timestamp all the way to the last and stops at the first match. 使它不贪心,所以它不会从第一个时间戳一直到最后一个,并在第一个匹配时停止。

Sample code below: 示例代码如下:

import pandas as pd
import re

data = pd.DataFrame({"id":[1,2],"ticket_desc":[r"<GMT2015-09-01 00:03:29GMT> Hi Rajiv, As part of our job Request for your approval. Thanks <GMT2015-09-01 19:08:15GMT> Hi Ram, Request Approved Thanks <GMT2015-09-01 19:08:15GMT>.",r"<GMT2015-09-01 00:03:29GMT> 2Hi Rajiv, As part of our job Request for your approval. Thanks <GMT2015-09-01 19:08:15GMT> Hi Ram, Request Approved Thanks <GMT2015-09-01 19:08:15GMT>."]})
def finder(x):
    return re.findall(">(.+?)<",x)[0]
data["ticket_desc"] = data["ticket_desc"].apply(finder)
print data["ticket_desc"][0]
print data["ticket_desc"][1]

Output: 输出:

Hi Rajiv, As part of our job Request for your approval. Thanks 


 2Hi Rajiv, As part of our job Request for your approval. Thanks 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM