简体   繁体   English

如何使用 python 从 txt 文件中提取和组织数据?

[英]How would I extract & organize data from a txt file using python?

1st Time Post here!第一次在这里发帖! Love this site.喜欢这个网站。

Situation : I have a flat file of data with various elements in it and I need to extract specific portions.情况:我有一个包含各种元素的平面数据文件,我需要提取特定部分。 I am a beginner in Python and wrote it out using Regular Expressions and other functions.我是 Python 的初学者,并使用正则表达式和其他函数将其写出来。 Here is a sample of the data from the txt file I receive:这是我收到的 txt 文件中的数据示例:


ACCESSORID = FS01234 TYPE = USER SIZE = 1024 BYTES NAME = JOHN SMITH FACILITY = TSO ACCESSORID = FS01234 TYPE = USER SIZE = 1024 BYTES NAME = JOHN SMITH FACILITY = TSO
DEPT ACID = D12RGRD DEPARTMENT = TRAINING部门酸 = D12RGRD 部门 = 培训
DIV ACID = NR DIVISION = NRE DIV ACID = NR 分区 = NRE
CREATED = 01/17/05 00:00 LAST MOD = 11/16/21 10:42已创建 = 01/17/05 00:00 最后一个模组 = 11/16/21 10:42
PROFILES = VPSNRE P11NR00A配置文件 = VPSNRE P11NR00A
LAST USED = 12/02/21 09:03 CPU(SYSB) FAC(SUPRSESS) COUNT(06051)最后使用 = 12/02/21 09:03 CPU(SYSB) FAC(SUPRSESS) 计数(06051)
XA SSN = 123456789 OWNER(JB112) XA SSN = 123456789所有者(JB112)
XA TSOACCT = 123456789 OWNER(JB112 ) XA TSOACCT = 123456789 所有者(JB112)
XA TSOAUTH = JCL OWNER(JB112 ) XA TSOAUTH = JCL 所有者(JB112)
XA TSOAUTH = RECOVER OWNER(JB112 ) XA TSOAUTH = 恢复所有者(JB112)
XA TSOPROC = NR005PROC OWNER(JB112 ) XA TSOPROC = NR005PROC 所有者(JB112)
----------- SEGMENT TSO ------------ 段 TSO
TRBA = NON-DISPLAY FIELD TRBA = 非显示字段
TSOCOMMAND = TSO 命令 =
TSODEFPRFG = TSODEFPRFG =
TSOLACCT = 111111111 TSOLACCT = 111111111
TSOLPROC = NR9923PROC TSOLPROC = NR9923PROC
TSOLSIZE = 0004096 TSOLSIZE = 0004096
TSOOPT = MAIL,NONOTICES,NOOIDCARD TSOOPT = 邮件、通知、NOOIDCARD
TSOUDATA = 0000 TSOUDATA = 0000
TSOUNIT = SYSDD TSOUNIT = SYSDD
TUPT = NON-DISPLAY FIELD TUPT = 非显示字段
----------- SEGMENT USER EMAIL ADDR = john.smith@nre.ago.com ------------ 段用户EMAIL ADDR = john.smith@nre.ago.com

The portions I need to extract are bolded.我需要提取的部分以粗体显示。 I know I need to provide what I have done so far and without posting my entire script, here is what I am doing to extract the ACCESSORID = FS01234 and NAME = JOHN SMITH portion.我知道我需要提供到目前为止我所做的并且没有发布我的整个脚本,这是我正在做的提取ACCESSORID = FS01234NAME = JOHN SMITH部分。

def RemoveSpace():
    f = open("PROJECTFILE.txt","r")
    f1 = open("RemoveSpace.txt", "w")
    data1 = f.read()
    word = data1.split()
    s = ' '.join(word)
    f1.write(s)
    print("Data Written Successfully")
    RemoveSpace()


f = open(r"C:\Users\user\Desktop\HR\PROJECTFILE\RemoveSpace.txt".format(g), "r").read()

TSS = []

 contents = re.split(r"ACCESSORID =",f)
 contents.pop(0)

for item in contents:
TSS_DICT = {}

emplid = re.search(r"FS.*", item)

if emplid is not None:
    s_emplid = re.search("FS\w*", emplid.group())
else:
    s_emplid = None
    
if s_emplid is not None:
    s_emplid = s_emplid.group()
else:
    s_emplid = None

TSS_DICT["EMPLOYEE ID"] = s_emplid

name = re.search(r"NAME =.*", item)

if name is not None:
    emp_name = re.search("[^NAME = ][^,]*", name.group())
else:
    emp_name = None

if emp_name is not None:
    emp_name = emp_name.group()
else:
    emp_name = None

TSS_DICT["EMPLOYEE NAME"] = emp_name

Question: Ok sorry for the lengthy post.问题:好的,很抱歉这篇冗长的帖子。 I am having some difficulty getting John Smith .我很难得到John Smith It keeps bringing in everything after John Smith down to very last line of email address.在 John Smith 之后,它不断引入所有内容,直到 email 地址的最后一行。 My end goal is to get a CSV file with each bolded item as its own column.我的最终目标是获得一个 CSV 文件,其中每个粗体项目作为自己的列。 And more directly speaking, how would experts approach this data clean up approach to simplify the process ?更直接地说,专家将如何使用这种数据清理方法来简化流程 If needed I can post full code but didn't want to muddle this up anymore than needed.如果需要,我可以发布完整的代码,但不想再把它弄糊涂了。

I really appreciate any time and consideration that you could afford.我真的很感激你能负担得起的任何时间和考虑。

JB JB

For practising your Regex, I recommend using a website like RegExr .为了练习你的正则表达式,我建议使用像RegExr这样的网站。 Here, you can paste the text that you want to match and you can play around with different matching expressions to get the result that you intend.在这里,您可以粘贴要匹配的文本,并且可以使用不同的匹配表达式来获得您想要的结果。

Assuming that you want to use this code for multiple files of the same organisation and that the data is formatted the same way in each, you can simplify your code a lot.假设您想将此代码用于同一组织的多个文件,并且每个文件的数据格式相同,您可以大大简化代码。

Let's say we wanted to extract NAME = JOHN SMITH from the text file.假设我们想从文本文件中提取NAME = JOHN SMITH We could write the following Python code to do this:我们可以编写以下 Python 代码来执行此操作:

import re
pattern = "NAME = \\w+ \\w+"
name = re.findall(pattern, text_to_search)[0][7:]
print(name)

pattern is our Regex search expression. pattern是我们的 Regex 搜索表达式。 text_to_search is your text file that you have read into your Python script. text_to_search是您已读入 Python 脚本的文本文件。 re.findall() returns a list of matched items that we then access the first index of with [0] . re.findall()返回匹配项的列表,然后我们使用[0]访问第一个索引。 We can then use string slicing ( [7:] ) to remove the NAME = bit.然后我们可以使用字符串切片( [7:] )删除NAME =位。

The above code would output the following:上面的代码将 output 如下:

JOHN SMITH

You should be able to apply the same principles to the other bold sections of your text file.您应该能够将相同的原则应用于文本文件的其他粗体部分。

In terms of writing your extracted data out to a CSV file, it is probably worth reading a good tutorial on this.就将提取的数据写入 CSV 文件而言,可能值得阅读一个很好的教程。 For example Reading and Writing CSV Files in Python .例如读取和写入 Python 中的 CSV 文件 There are a few different ways of storing your information before writing, such as lists vs dictionaries.在写作之前有几种不同的方式来存储你的信息,例如列表与字典。 But you can write CSV files either with built-in Python tools or manually.但是您可以使用内置的 Python 工具或手动编写 CSV 文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM