[英]Extracting information from textfile through regex and/or python
I'm working with a large amount of files (~4gb worth) which all contain anywhere between 1 and 100 entries with the following format (between two *** is one entry): 我正在处理大量的文件(价值约4GB),这些文件都包含1到100个条目之间的任何格式,格式如下(两个***之间是一个条目):
***
Type:status
Origin: @z_rose yes
Text: yes
URL:
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
***
Type:status
Origin: @aaronesilvers text
Text: text
URL:
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621
Hashtags:
***
***
Type:status
Origin: @z_rose text
Text: text and stuff
URL:
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
Now I want to somehow import these into Pandas for mass analysis, but obviously I'd have to convert this into a format Pandas can handle. 现在我想以某种方式将这些导入Pandas进行质量分析,但显然我必须将其转换为Pandas可以处理的格式。 So I want to write a script that converts the above into a .csv looking something like this (User is the file title):
所以我想编写一个脚本,将上面的内容转换为.csv,看起来像这样(用户是文件标题):
User Type Origin Text URL ID Time RetCount Favorite MentionedEntities Hashtags
4012987 status @z_rose yes yes Null 95482459084427264 Mon Jul 25 08:16:06 CDT 2011 0 false 20776334 Null
4012987 status @aaronsilvers text text Null 95481610861953024 Mon Jul 25 08:12:44 CDT 2011 0 false 2226621 Null
(Formatting isn't perfect but hopefully you get the idea) (格式化并不完美,但希望你能得到这个想法)
I've had some code working that worked on the basis of it regularly being information in segments of 12, but sadly some of the files contain several whitelines in some fields. 我已经有一些代码工作,它的工作基于它经常是12段的信息,但遗憾的是一些文件在某些领域包含几个白线。 What I'm basically looking to do is:
我基本上要做的是:
fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
if(starPair == 0)
starPair++;
if(starPair == 1){
row=row++;
starPair = 0;
}
}
#if string read matches column field
if(stringRead == fields[])
while(strRead != fields[]) #until next field has been found
#extract all characters into correct column field
However the issue arises that some fields can contain the words in fields[].. I can check for a \\n char first, which would greatly reduce the amount of faulty entries, but wouldn't eliminate them. 然而,问题出现了一些字段可以包含字段[]中的单词。我可以首先检查\\ n char,这将大大减少错误条目的数量,但不会消除它们。
Can anyone point me in the right direction? 谁能指出我正确的方向?
Thanks in advance! 提前致谢!
You may use a combination of a regular expressions and a dict comprehension: 您可以使用正则表达式和字典理解的组合:
import regex as re, pandas as pd
rx_parts = re.compile(r'^{}$(?s:.*?)^{}$'.format(re.escape('***'), re.escape('***')), re.MULTILINE)
rx_entry = re.compile(r'^(?P<key>\w+):[ ]*(?P<value>.+)$', re.MULTILINE)
result = ({m.group('key'): m.group('value')
for m in rx_entry.finditer(part.group(0))}
for part in rx_parts.finditer(your_string_here))
df = pd.DataFrame(result)
print(df)
Which yields 哪个收益率
Favorite Hashtags ID MentionedEntities Origin \
0 false 95482459084427264 20776334 @z_rose yes
1 false 95481610861953024 2226621 @aaronesilvers text
2 false 95480980026040320 20776334 @z_rose text
RetCount Text Time Type URL
0 0 yes Mon Jul 25 08:16:06 CDT 2011 status
1 0 text Mon Jul 25 08:12:44 CDT 2011 status
2 0 text and stuff Mon Jul 25 08:10:14 CDT 2011 status
***
on both sides ***
包围 We end up having a generator of dictionaries which we then feed into pandas
. 我们最终得到了一个词典生成器,然后我们将它们输入到
pandas
。
Hints: 提示:
The code has not been tested with large amounts of data, especially not 4gb. 该代码尚未使用大量数据进行测试,尤其是4gb。 Additionally, you'll need the newer
regex
module for the expression to work. 此外,您需要更新的
regex
模块才能使表达式正常工作。
Your code/pseudo-code doesn't look like python but because you have the python tag here is how I would do it. 你的代码/伪代码看起来不像python,但因为你在这里有python标签就是我会这样做的。 First, read the file into a string, then go through each field and make a regular expression to find the value after it, push the result into a 2d list, and then output that 2d list into a CSV.
首先,将文件读入字符串,然后遍历每个字段并创建正则表达式以查找其后的值,将结果推送到2d列表,然后将该2d列表输出到CSV中。 Also, your CSV looks more like a TSV (tab separated instead of comma separated).
此外,您的CSV看起来更像TSV(标签分隔而不是逗号分隔)。
import re
import csv
filename='4012987'
User=filename
# read your file into a string
with open(filename, 'r') as myfile:
data=myfile.read()
fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]
# for each field use regex to get the entry
for n,field in enumerate(fields):
matches = re.findall(field+':\s?([^\n]*)\n+', data)
# this should run only the first time to fill your 2d list with the right amount of lists
while len(csvTemplate)<=len(matches):
csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
for e,m in enumerate(matches):
if m != '':
csvTemplate[e+1][n+1]=m.strip()
# set the User column
for i in range(1,len(csvTemplate)):
csvTemplate[i][0] = User
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(csvTemplate)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.