简体   繁体   English

Python文本提取

[英]Python text extraction

I'm working on a text extraction with python.我正在使用 python 进行文本提取。 The output is not as desirable as I want it!输出并不像我想要的那样理想!

I have a text file containing information like this:我有一个包含如下信息的文本文件:

FN Clarivate Analytics Web of Science
VR 1.0

PT J

AU Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

AF Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

TI Examination of relationships among trait-like individual differences,

   state-like individual differences, and learning performance

SO JOURNAL OF APPLIED PSYCHOLOGY

CT 13th Annual Conference of the

   Society-for-Industrial-and-Organizational-Psychology

CY APR 24-26, 1998

CL DALLAS, TEXAS

SP Soc Ind & Org Psychol

RI Gully, Stanley/D-1302-2012

OI Gully, Stanley/0000-0003-4037-3883

SN 0021-9010

PD DEC

PY 2000

VL 85

IS 6

BP 835

EP 847

DI 10.1037//0021-9010.85.6.835

UT WOS:000165745400001

PM 11125649

ER

and when I use my code like this当我像这样使用我的代码时

import random
import sys

filepath = "data\jap_2000-2001-plain.txt"

with open(filepath) as f:
    articles = f.read().strip().split("\n")

articles_list = []

author = ""
title = ""
year = ""
doi = ""

for article in articles:
    if "AU" in article:
        author = article.split("#")[-1]
    if "TI" in article:
        title = article.split("#")[-1]
    if "PY" in article:
        year = article.split("#")[-1]
    if "DI" in article:
        doi = article.split("#")[-1]
    if article == "ER#":
        articles_list.append("{}, {}, {}, https://doi.org/{}".format(author, title, year, doi))
print("Oh hello sir, how many articles do you like to get?")
amount = input()

random_articles = random.sample(articles_list, k = int(amount))


for i in random_articles:
    print(i)
    print("\n")

exit = input('Please enter exit to exit: \n')
if exit in ['exit','Exit']:
    print("Goodbye sir!")
    sys.exit()

The extraction does not include data that has been entered after the linebreak, If I run this code, output would look like "AU Chen, G" and does not include the other names, same with the Title etc etc.提取不包括换行后输入的数据,如果我运行此代码,输出看起来像“AU Chen,G”并且不包括其他名称,与标题等相同。

My output looks like:我的输出看起来像:

Chen, G. Examination of relationships among trait, 2000, doi.dx.10.1037//0021-9010.85.6.835 Chen, G. 特质间关系的检验, 2000, doi.dx.10.1037//0021-9010.85.6.835

The desired output should be:所需的输出应该是:

Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000, Examination of relationships among trait-like individual differences, state-like individual differences, and learning performance, doi.dx.10.1037//0021-9010.85.6.835 Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000,特质样个体差异、状态样个体差异和学习表现之间关系的检验,doi.dx.10.1037//0021 -9010.85.6.835

but the extraction only includes the first row of each line –但提取只包括每一行的第一行——

Any suggestions?有什么建议?

You need to track what section you are in as you are parsing the file.您需要在解析文件时跟踪您所在的部分。 There are cleaner ways to write the state machine, but as a quick and simple example, you could do something like below.有更简洁的方法来编写状态机,但作为一个快速而简单的示例,您可以执行以下操作。

Basically, add all the lines for each section to a list for that section, then combine the lists and do whatever at the end.基本上,将每个部分的所有行添加到该部分的列表中,然后组合列表并在最后执行任何操作。 Note, I didn't test this, just psuedo-coding to show you the general idea.请注意,我没有对此进行测试,只是通过伪编码向您展示总体思路。

authors = []
title = []
section = None

for line in articles:
    line = line.strip()

    # Check for start of new section, select the right list to add to
    if line.startswith("AU"):
        line = line[3:]
        section = authors
    elif line.startswith("TI"):
        line = line[3:]
        section = title
    # Other sections..
    ...

    # Add line to the current section
    if line and section is not None:
        section.append(line)

authors_str = ', '.join(authors)
title_str = ' '.join(title)
print authors_str, title_str

Initial Understanding初步了解

Based on your example, I believe that:根据你的例子,我相信:

  • The text is provided in lines.文本按提供
  • The example text appears to have too many newlines, possibly an artifact of it being migrated from DOS/Windows?示例文本似乎有太多换行符,可能是它从 DOS/Windows 迁移的产物? If so, either CRLF processing is needed, or alternate lines should be ignored.如果是这样,要么需要 CRLF 处理,要么应忽略备用行。
  • The lines are divided into sections.这些线被分成几个部分。
  • Each section is delimited by a two-letter uppercase tag in columns 0,1 at the first line in the section, and continues until the start of a new section .每个部分由该部分第一行的0,1 列中的两个字母大写标记分隔并一直持续到新部分的开始。
  • Each line has either a tag or 2 blank spaces, followed by a blank space, in columns 0-2.有一个标签或 2 个空格,后跟一个空格,位于 0-2 列。
  • The artificial section delimited by tag ER marks the end-of-record .标记ER分隔的人工部分标志着记录结束
  • The ER section contains no usable text. ER部分不包含可用的文本。

It may also be the case that:也可能是这样:

  • Records are begun by the FN tag.记录以FN标签开始。
  • Any text encountered outside of a FN / ER pair can be ignored.FN / ER对之外遇到的任何文本都可以忽略。

Suggested Design建议设计

If this is true, I recommend you write a text processor using that logic:如果这是真的,我建议您使用该逻辑编写文本处理器:

  • Read lines.读行。
  • Handle CR/LF processing;处理CR/LF处理; or skip alternate lines;或跳过交替行; or "don't worry the real text doesn't have these line breaks"?或者“不要担心真正的文本没有这些换行符”?
  • Use a state machine with an unknown number of states, the initial state being ER .使用状态数未知的状态机,初始状态为ER
  • Special rule: Ignore text in the ER state until a FN line is encountered.特殊规则:忽略ER状态中的文本,直到遇到FN行。
  • General rule: when a tag is seen, end the previous state and begin a new state named after the seen tag.一般规则:当看到一个标签时,结束之前的状态并开始一个以看到的标签命名的新状态。 Any accumulated text is added to the record.任何累积的文本都会添加到记录中。
  • If no tag is seen, accumulate text in the previous tag.如果没有看到标签,则在前一个标签中累积文本。
  • Special rule: when the ER state is entered, add the accumulated record to the list of accumulated records.特殊规则:当进入ER状态时,将累积记录添加到累积记录列表中。

At the end of this process, you will have a list of records, having various accumulated tags.在此过程结束时,您将获得一个记录列表,其中包含各种累积的标签。 You may then process the tags in various ways.然后,您可以以各种方式处理标签。

Something like this:像这样的东西:

from warnings import warn

Debug = True

def read_lines_from(file):
    """Read and split lines from file. This is a separate function, instead
       of just using file.readlines(), in case extra work is needed like
       dos-to-unix conversion inside a unix environment.
    """
    with open(file) as f:
        text = f.read()
        lines = text.split('\n')

    return lines

def parse_file(file):
    """Parse file in format given by 
        https://stackoverflow.com/questions/54520331
    """
    lines = read_lines_from(file)
    state = 'ER'
    records = []
    current = None

    for line_no, line in enumerate(lines):
        tag, rest = line[:2], line[3:]

        if Debug:
            print(F"State: {state}, Tag: {tag}, Rest: {rest}")

        # Skip empty lines
        if tag == '':
            if Debug:
                print(F"Skip empty line at {line_no}")
            continue

        if tag == '  ':
            # Append text, except in ER state.
            if state != 'ER':
                if Debug:
                    print(F"Append text to {state}: {rest}")
                current[state].append(rest)
            continue

        # Found a tag. Process it.

        if tag == 'ER':
            if Debug:
                print("Tag 'ER'. Completed record:")
                print(current)

            records.append(current)
            current = None
            state = tag
            continue

        if tag == 'FN':
            if state != 'ER':
                warn(F"Found 'FN' tag without previous 'ER' at line {line_no}")
                if len(current.keys()):
                    warn(F"Previous record (FN:{current['FN']}) discarded.")

            if Debug:
                print("Tag 'FN'. Create empty record.")

            current = {}

        # All tags except ER get this:
        if Debug:
            print(F"Tag '{tag}'. Create list with rest: {rest}")

        current[tag] = [rest]
        state = tag

    return records

if __name__ == '__main__':
    records = parse_file('input.txt')
    print('Records =', records)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM