简体   繁体   English

读取行以特殊字符开头的 .txt 文件

[英]Reading a .txt file having lines starting with special characters

I am trying to read in every line in a file that starts with different special characters.我试图读取以不同特殊字符开头的文件中的每一行。 I don't want to read those special characters just the rest of the line that follows.我不想在接下来的行中阅读那些特殊字符。 The records are presented in the file like this: enter image description here Then I need to store each document in a dataframe's row.记录在文件中显示如下:在此处输入图像描述然后我需要将每个文档存储在数据框的行中。

This is what I tried:这是我尝试过的:

Doc = namedtuple('Doc', 'venue year authors title id references abstract')
docs = []

with open('/content/test.txt') as f:
  for l in f.readlines():
   ln = l.rstrip('\n')
   if  ln.startswith('#c'):
    venue = ln[2:]
    #print(venue)
   if  ln.startswith('#t'):
     year = ln[2:]
     #print(year)
   if  ln.startswith('#@'):
     authors = []
     author = ln[2:]
     authors.append(author)
   if  ln.startswith('#*'):
     title = ln[2:]
   if  ln.startswith('#index'):
     id = ln[2:]
   if ln.startswith('#%'):
        references = []
        reference = ln[2:]
        references.append(reference)
   if  ln.startswith('#!'):
     abstract = ln[2:]
     print(abstract)
     docs.append(Doc(venue, year, authors, title, id, references, abstract))

   df = pd.DataFrame.from_records(docs, columns= 
   ['Venue','Year','Authors','Title','id', 'ListCitations','Abstract'])
df

Can someone help me to solve the problem of variables storing in a list as authors and references?有人可以帮我解决将变量作为作者和引用存储在列表中的问题吗? Thank you谢谢

Finally fixed all issue########### 终于解决了所有问题###########

Fixed the code i guess u are missing some references so I just imported them.修复了代码,我猜你缺少一些参考,所以我只是导入了它们。 Also before running the code u might want to install below mentioned package called pandas同样在运行代码之前,您可能想要安装下面提到的名为 pandas 的包

I am currently using python 3.8, if your using the python 2 version then u might want to skip the 3 in pip command我目前使用的是 python 3.8,如果你使用的是 python 2 版本,那么你可能想跳过 pip 命令中的 3

from collections import namedtuple
import pandas as pd

Doc = namedtuple('Doc', 'venue year authors title id references abstract')
docs = []

f = open("content/test.txt", "r")

while(True):
    # Read a line.
    #if not ln.startswith("#t") or not ln.startswith("#index"):
    ln = f.readline()
    print('{} : {}'.format("LINE : ", ln))
    ln = ln.rstrip('\n')
    
    # When readline returns an empty string, the file is fully read.
    if ln == "":
        print("::DONE::")
        break
    # When a newline is returned, the line is empty.
    if ln == "\n":
        #print("::EMPTY LINE::")
        continue
    if ln.startswith('#c'):
        venue = ln[2:]
        print(venue)
    if ln.startswith('#t'):
        year = ln[2:]
        print(year)
    if ln.startswith('#@'):
        authors = []
        while not ln.startswith('#t'):
            if ln.startswith('#@') :
                author = ln[2:]
            else:
                author = ln[0:]
            authors.append(author)
            ln = f.readline()
            print(author)
        f.seek(f.tell() - len(ln))
        ",".join(authors)
    if ln.startswith('#*'):
        title = ln[2:]
        print(title)
    if ln.startswith('#index'):
        id = ln[6:]
        print(id)
    if ln.startswith('#%'):
        references = []
        while not ln.startswith('#!'):
            reference = ln[2:]
            references.append(reference)
            ln = f.readline()
            print(reference)
        ",".join(references)    
    if  ln.startswith('#!'):
        abstract = ln[2:]
        print(abstract)
        docs.append(Doc(venue, year, authors, title, id, references, abstract))

df = pd.DataFrame.from_records(docs, columns= 
                ['Venue','Year','Authors','Title','id', 'ListCitations','Abstract'])

pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)

Code:代码:

#*Information geometry of U-Boost and Bregman divergence
#@Nobotu Murata,
Takenouchi
Takafumi Kanamori
#t2004
#cNeural Computation
#index436405
#%94584
#%282290
#%605546
#%620759
#%564878
#!We aim at an extension of AdaBoost to U-Boost
#*Paper 2
#@Tareq 
Shareq
Sameena
#t2016
#cSimulation Computation
#index436406
#%94584
#%282291
#%605543
#%620754
#%323232232232323
#!We aim to conquere the world

Test input :测试输入:

                Venue  Year  \

Output:输出:

                                         Authors  \

0 Neural Computation 2004 0 神经计算 2004
1 Simulation Computation 2016 1 仿真计算 2016

                                           Title      id  \

0 [Nobotu Murata,, Takenouchi\\n, Takafumi Kanamo... 0 [Nobotu Murata,, Takenouchi\\n, Takafumi Kanamo...
1 [Tareq , Shareq\\n, Sameena\\n] 1 [Tareq , Shareq\\n, Sameena\\n]

                                   ListCitations  \

0 Information geometry of U-Boost and Bregman di... 436405 0 U-Boost 和 Bregman di... 436405 的信息几何
1 Paper 2 436406 1 纸 2 436406

                                      Abstract  

0 [94584, 282290\\n, 605546\\n, 620759\\n, 564878\\n] 0 [94584, 282290\\n, 605546\\n, 620759\\n, 564878\\n]
1 [94584, 282291\\n, 605543\\n, 620754\\n, 32323223... 1 [94584, 282291\\n, 605543\\n, 620754\\n, 32323223...

 Abstract

0 We aim at an extension of AdaBoost to U-Boost\\n 0 我们的目标是将 AdaBoost 扩展到 U-Boost\\n
1 We aim to conquere the world\\n 1 我们的目标是征服世界\\n

The code bellow works fine with authors but with references still storing only one element:下面的代码适用于作者,但引用仍然只存储一个元素:

Doc = namedtuple('Doc', 'venue year authors nba title id references abstract')
docs = []
with open('/content/test.txt') as f:
   for l in f.readlines():
       ln = l.rstrip('\n')
       if  ln.startswith('#c'):
        venue = ln[2:]
        #print(venue)
       if  ln.startswith('#t'):
         year = ln[2:]
         #print(year)
       if  ln.startswith('#@'):
         authors = []
         author = ln[2:]
         authors.append(author)
       else: 
         if not ln.startswith('#') and not ln.startswith(' ') :
            author = ln[0:]
            authors.append(author)
            #print(authors)
            nba = len(authors)
       if  ln.startswith('#*'):
         title = ln[2:]
         #print(title)
       if  ln.startswith('#!'):
         abstract = ln[2:]
       else : 
         if ln.startswith(' '):
           abstract += ln[1:]
           #print(abstract)
       if   ln.startswith('#%'):
           references = []
           reference = ln[2:]
           references.append(reference)    
       else: 
         if  ln.startswith('#%'):
           reference = ln[2:]
           references.append(reference)
           #print(references)
       if  ln.startswith('#index'):
         id = ln[6:]
         docs.append(Doc(venue, year, authors, nba, title, id, references, abstract))
         df = pd.DataFrame.from_records(docs, columns=['Venue','Year','Authors','nba', 'Title','id', 'ListCitations','Abstract'])
df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM