简体   繁体   English

通过regex和/或python从textfile中提取信息

[英]Extracting information from textfile through regex and/or python

I'm working with a large amount of files (~4gb worth) which all contain anywhere between 1 and 100 entries with the following format (between two *** is one entry): 我正在处理大量的文件(价值约4GB),这些文件都包含1到100个条目之间的任何格式,格式如下(两个***之间是一个条目):

***
Type:status
Origin: @z_rose yes
Text:  yes
URL: 
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***
***
Type:status
Origin: @aaronesilvers text
Text:  text
URL: 
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621 
Hashtags: 
***
***
Type:status
Origin: @z_rose text
Text:  text and stuff
URL: 
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***

Now I want to somehow import these into Pandas for mass analysis, but obviously I'd have to convert this into a format Pandas can handle. 现在我想以某种方式将这些导入Pandas进行质量分析,但显然我必须将其转换为Pandas可以处理的格式。 So I want to write a script that converts the above into a .csv looking something like this (User is the file title): 所以我想编写一个脚本,将上面的内容转换为.csv,看起来像这样(用户是文件标题):

User   Type    Origin              Text  URL    ID                Time                          RetCount  Favorite  MentionedEntities  Hashtags
4012987 status  @z_rose yes         yes   Null   95482459084427264  Mon Jul 25 08:16:06 CDT 2011  0           false  20776334            Null
4012987 status  @aaronsilvers text  text Null    95481610861953024   Mon Jul 25 08:12:44 CDT 2011  0           false   2226621            Null   

(Formatting isn't perfect but hopefully you get the idea) (格式化并不完美,但希望你能得到这个想法)

I've had some code working that worked on the basis of it regularly being information in segments of 12, but sadly some of the files contain several whitelines in some fields. 我已经有一些代码工作,它的工作基于它经常是12段的信息,但遗憾的是一些文件在某些​​领域包含几个白线。 What I'm basically looking to do is: 我基本上要做的是:

fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
    if(starPair == 0)
        starPair++;
    if(starPair == 1){
        row=row++;
        starPair = 0;
    }
}
#if string read matches column field
if(stringRead == fields[])
    while(strRead != fields[]) #until next field has been found
        #extract all characters into correct column field

However the issue arises that some fields can contain the words in fields[].. I can check for a \\n char first, which would greatly reduce the amount of faulty entries, but wouldn't eliminate them. 然而,问题出现了一些字段可以包含字段[]中的单词。我可以首先检查\\ n char,这将大大减少错误条目的数量,但不会消除它们。

Can anyone point me in the right direction? 谁能指出我正确的方向?

Thanks in advance! 提前致谢!

You may use a combination of a regular expressions and a dict comprehension: 您可以使用正则表达式和字典理解的组合:

import regex as re, pandas as pd

rx_parts = re.compile(r'^{}$(?s:.*?)^{}$'.format(re.escape('***'), re.escape('***')), re.MULTILINE)
rx_entry = re.compile(r'^(?P<key>\w+):[ ]*(?P<value>.+)$', re.MULTILINE)

result = ({m.group('key'): m.group('value') 
            for m in rx_entry.finditer(part.group(0))}
            for part in rx_parts.finditer(your_string_here))

df = pd.DataFrame(result)
print(df)

Which yields 哪个收益率

  Favorite Hashtags                 ID MentionedEntities               Origin  \
0    false           95482459084427264         20776334           @z_rose yes   
1    false           95481610861953024          2226621   @aaronesilvers text   
2    false           95480980026040320         20776334          @z_rose text   

  RetCount            Text                          Time    Type URL  
0        0             yes  Mon Jul 25 08:16:06 CDT 2011  status      
1        0            text  Mon Jul 25 08:12:44 CDT 2011  status      
2        0  text and stuff  Mon Jul 25 08:10:14 CDT 2011  status      


Explanation: 说明:

  1. Divide the string into different parts, surrounded by *** on both sides 将琴弦分成不同的部分,两侧用***包围
  2. Look for key-values pairs in each line 在每一行中查找键值对
  3. Put all pairs in a dict 把所有对都放在一个字典中

We end up having a generator of dictionaries which we then feed into pandas . 我们最终得到了一个词典生成器,然后我们将它们输入到pandas

Hints: 提示:

The code has not been tested with large amounts of data, especially not 4gb. 该代码尚未使用大量数据进行测试,尤其是4gb。 Additionally, you'll need the newer regex module for the expression to work. 此外,您需要更新的regex模块才能使表达式正常工作。

Your code/pseudo-code doesn't look like python but because you have the python tag here is how I would do it. 你的代码/伪代码看起来不像python,但因为你在这里有python标签就是我会这样做的。 First, read the file into a string, then go through each field and make a regular expression to find the value after it, push the result into a 2d list, and then output that 2d list into a CSV. 首先,将文件读入字符串,然后遍历每个字段并创建正则表达式以查找其后的值,将结果推送到2d列表,然后将该2d列表输出到CSV中。 Also, your CSV looks more like a TSV (tab separated instead of comma separated). 此外,您的CSV看起来更像TSV(标签分隔而不是逗号分隔)。

import re
import csv

filename='4012987'
User=filename

# read your file into a string
with open(filename, 'r') as myfile:
    data=myfile.read()

fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]

# for each field use regex to get the entry
for n,field in enumerate(fields):
  matches = re.findall(field+':\s?([^\n]*)\n+', data)
  # this should run only the first time to fill your 2d list with the right amount of lists
  while len(csvTemplate)<=len(matches):
    csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
  for e,m in enumerate(matches):
    if m != '':
      csvTemplate[e+1][n+1]=m.strip()
# set the User column
for i in range(1,len(csvTemplate)):
  csvTemplate[i][0] = User
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
with open("output.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(csvTemplate)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM