通過regex和/或python從textfile中提取信息

Question

我正在處理大量的文件（價值約4GB），這些文件都包含1到100個條目之間的任何格式，格式如下（兩個***之間是一個條目）：

***
Type:status
Origin: @z_rose yes
Text:  yes
URL: 
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***
***
Type:status
Origin: @aaronesilvers text
Text:  text
URL: 
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621 
Hashtags: 
***
***
Type:status
Origin: @z_rose text
Text:  text and stuff
URL: 
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334 
Hashtags: 
***

現在我想以某種方式將這些導入Pandas進行質量分析，但顯然我必須將其轉換為Pandas可以處理的格式。 所以我想編寫一個腳本，將上面的內容轉換為.csv，看起來像這樣（用戶是文件標題）：

User   Type    Origin              Text  URL    ID                Time                          RetCount  Favorite  MentionedEntities  Hashtags
4012987 status  @z_rose yes         yes   Null   95482459084427264  Mon Jul 25 08:16:06 CDT 2011  0           false  20776334            Null
4012987 status  @aaronsilvers text  text Null    95481610861953024   Mon Jul 25 08:12:44 CDT 2011  0           false   2226621            Null

（格式化並不完美，但希望你能得到這個想法）

我已經有一些代碼工作，它的工作基於它經常是12段的信息，但遺憾的是一些文件在某些領域包含幾個白線。 我基本上要做的是：

fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
    if(starPair == 0)
        starPair++;
    if(starPair == 1){
        row=row++;
        starPair = 0;
    }
}
#if string read matches column field
if(stringRead == fields[])
    while(strRead != fields[]) #until next field has been found
        #extract all characters into correct column field

然而，問題出現了一些字段可以包含字段[]中的單詞。我可以首先檢查\\ n char，這將大大減少錯誤條目的數量，但不會消除它們。

誰能指出我正確的方向？

提前致謝！

Answer 1

您可以使用正則表達式和字典理解的組合：

import regex as re, pandas as pd

rx_parts = re.compile(r'^{}$(?s:.*?)^{}$'.format(re.escape('***'), re.escape('***')), re.MULTILINE)
rx_entry = re.compile(r'^(?P<key>\w+):[ ]*(?P<value>.+)$', re.MULTILINE)

result = ({m.group('key'): m.group('value') 
            for m in rx_entry.finditer(part.group(0))}
            for part in rx_parts.finditer(your_string_here))

df = pd.DataFrame(result)
print(df)

哪個收益率

  Favorite Hashtags                 ID MentionedEntities               Origin  \
0    false           95482459084427264         20776334           @z_rose yes   
1    false           95481610861953024          2226621   @aaronesilvers text   
2    false           95480980026040320         20776334          @z_rose text   

  RetCount            Text                          Time    Type URL  
0        0             yes  Mon Jul 25 08:16:06 CDT 2011  status      
1        0            text  Mon Jul 25 08:12:44 CDT 2011  status      
2        0  text and stuff  Mon Jul 25 08:10:14 CDT 2011  status

說明：

將琴弦分成不同的部分，兩側用***包圍
在每一行中查找鍵值對
把所有對都放在一個字典中

我們最終得到了一個詞典生成器，然后我們將它們輸入到pandas 。

提示：

該代碼尚未使用大量數據進行測試，尤其是4gb。 此外，您需要更新的regex模塊才能使表達式正常工作。

Answer 2

你的代碼/偽代碼看起來不像python，但因為你在這里有python標簽就是我會這樣做的。 首先，將文件讀入字符串，然后遍歷每個字段並創建正則表達式以查找其后的值，將結果推送到2d列表，然后將該2d列表輸出到CSV中。 此外，您的CSV看起來更像TSV（標簽分隔而不是逗號分隔）。

import re
import csv

filename='4012987'
User=filename

# read your file into a string
with open(filename, 'r') as myfile:
    data=myfile.read()

fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]

# for each field use regex to get the entry
for n,field in enumerate(fields):
  matches = re.findall(field+':\s?([^\n]*)\n+', data)
  # this should run only the first time to fill your 2d list with the right amount of lists
  while len(csvTemplate)<=len(matches):
    csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
  for e,m in enumerate(matches):
    if m != '':
      csvTemplate[e+1][n+1]=m.strip()
# set the User column
for i in range(1,len(csvTemplate)):
  csvTemplate[i][0] = User
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
with open("output.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(csvTemplate)

通過regex和/或python從textfile中提取信息

問題描述

2 個解決方案

解決方案1
2 2017-05-31 17:06:29

解決方案2
1 已采納 2017-05-31 15:13:00

通過regex和/或python從textfile中提取信息

問題描述

2 個解決方案

解決方案1 2 2017-05-31 17:06:29

解決方案2 1 已采納 2017-05-31 15:13:00

解決方案1
2 2017-05-31 17:06:29

解決方案2
1 已采納 2017-05-31 15:13:00