Python文本提取

Question

I'm working on a text extraction with python.我正在使用 python 进行文本提取。 The output is not as desirable as I want it!输出并不像我想要的那样理想！

I have a text file containing information like this:我有一个包含如下信息的文本文件：

FN Clarivate Analytics Web of Science
VR 1.0

PT J

AU Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

AF Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

TI Examination of relationships among trait-like individual differences,

   state-like individual differences, and learning performance

SO JOURNAL OF APPLIED PSYCHOLOGY

CT 13th Annual Conference of the

   Society-for-Industrial-and-Organizational-Psychology

CY APR 24-26, 1998

CL DALLAS, TEXAS

SP Soc Ind & Org Psychol

RI Gully, Stanley/D-1302-2012

OI Gully, Stanley/0000-0003-4037-3883

SN 0021-9010

PD DEC

PY 2000

VL 85

IS 6

BP 835

EP 847

DI 10.1037//0021-9010.85.6.835

UT WOS:000165745400001

PM 11125649

ER

and when I use my code like this当我像这样使用我的代码时

import random
import sys

filepath = "data\jap_2000-2001-plain.txt"

with open(filepath) as f:
    articles = f.read().strip().split("\n")

articles_list = []

author = ""
title = ""
year = ""
doi = ""

for article in articles:
    if "AU" in article:
        author = article.split("#")[-1]
    if "TI" in article:
        title = article.split("#")[-1]
    if "PY" in article:
        year = article.split("#")[-1]
    if "DI" in article:
        doi = article.split("#")[-1]
    if article == "ER#":
        articles_list.append("{}, {}, {}, https://doi.org/{}".format(author, title, year, doi))
print("Oh hello sir, how many articles do you like to get?")
amount = input()

random_articles = random.sample(articles_list, k = int(amount))


for i in random_articles:
    print(i)
    print("\n")

exit = input('Please enter exit to exit: \n')
if exit in ['exit','Exit']:
    print("Goodbye sir!")
    sys.exit()

The extraction does not include data that has been entered after the linebreak, If I run this code, output would look like "AU Chen, G" and does not include the other names, same with the Title etc etc.提取不包括换行后输入的数据，如果我运行此代码，输出看起来像“AU Chen，G”并且不包括其他名称，与标题等相同。

My output looks like:我的输出看起来像：

Chen, G. Examination of relationships among trait, 2000, doi.dx.10.1037//0021-9010.85.6.835 Chen, G. 特质间关系的检验, 2000, doi.dx.10.1037//0021-9010.85.6.835

The desired output should be:所需的输出应该是：

Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000, Examination of relationships among trait-like individual differences, state-like individual differences, and learning performance, doi.dx.10.1037//0021-9010.85.6.835 Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000，特质样个体差异、状态样个体差异和学习表现之间关系的检验，doi.dx.10.1037//0021 -9010.85.6.835

but the extraction only includes the first row of each line –但提取只包括每一行的第一行——

Any suggestions?有什么建议？

Answer 1

You need to track what section you are in as you are parsing the file.您需要在解析文件时跟踪您所在的部分。 There are cleaner ways to write the state machine, but as a quick and simple example, you could do something like below.有更简洁的方法来编写状态机，但作为一个快速而简单的示例，您可以执行以下操作。

Basically, add all the lines for each section to a list for that section, then combine the lists and do whatever at the end.基本上，将每个部分的所有行添加到该部分的列表中，然后组合列表并在最后执行任何操作。 Note, I didn't test this, just psuedo-coding to show you the general idea.请注意，我没有对此进行测试，只是通过伪编码向您展示总体思路。

authors = []
title = []
section = None

for line in articles:
    line = line.strip()

    # Check for start of new section, select the right list to add to
    if line.startswith("AU"):
        line = line[3:]
        section = authors
    elif line.startswith("TI"):
        line = line[3:]
        section = title
    # Other sections..
    ...

    # Add line to the current section
    if line and section is not None:
        section.append(line)

authors_str = ', '.join(authors)
title_str = ' '.join(title)
print authors_str, title_str

Answer 2

Initial Understanding初步了解

Based on your example, I believe that:根据你的例子，我相信：

The text is provided in lines.文本按行提供。
The example text appears to have too many newlines, possibly an artifact of it being migrated from DOS/Windows?示例文本似乎有太多换行符，可能是它从 DOS/Windows 迁移的产物？ If so, either CRLF processing is needed, or alternate lines should be ignored.如果是这样，要么需要 CRLF 处理，要么应忽略备用行。
The lines are divided into sections.这些线被分成几个部分。
Each section is delimited by a two-letter uppercase tag in columns 0,1 at the first line in the section, and continues until the start of a new section .每个部分由该部分第一行的0,1 列中的两个字母大写标记分隔，并一直持续到新部分的开始。
Each line has either a tag or 2 blank spaces, followed by a blank space, in columns 0-2.每行有一个标签或 2 个空格，后跟一个空格，位于 0-2 列。
The artificial section delimited by tag ER marks the end-of-record .由标记ER分隔的人工部分标志着记录结束。
The ER section contains no usable text. ER部分不包含可用的文本。

It may also be the case that:也可能是这样：

Records are begun by the FN tag.记录以FN标签开始。
Any text encountered outside of a FN / ER pair can be ignored.在FN / ER对之外遇到的任何文本都可以忽略。

Suggested Design建议设计

If this is true, I recommend you write a text processor using that logic:如果这是真的，我建议您使用该逻辑编写文本处理器：

Read lines.读行。
Handle CR/LF processing;处理CR/LF处理； or skip alternate lines;或跳过交替行； or "don't worry the real text doesn't have these line breaks"?或者“不要担心真正的文本没有这些换行符”？
Use a state machine with an unknown number of states, the initial state being ER .使用状态数未知的状态机，初始状态为ER 。
Special rule: Ignore text in the ER state until a FN line is encountered.特殊规则：忽略ER状态中的文本，直到遇到FN行。
General rule: when a tag is seen, end the previous state and begin a new state named after the seen tag.一般规则：当看到一个标签时，结束之前的状态并开始一个以看到的标签命名的新状态。 Any accumulated text is added to the record.任何累积的文本都会添加到记录中。
If no tag is seen, accumulate text in the previous tag.如果没有看到标签，则在前一个标签中累积文本。
Special rule: when the ER state is entered, add the accumulated record to the list of accumulated records.特殊规则：当进入ER状态时，将累积记录添加到累积记录列表中。

At the end of this process, you will have a list of records, having various accumulated tags.在此过程结束时，您将获得一个记录列表，其中包含各种累积的标签。 You may then process the tags in various ways.然后，您可以以各种方式处理标签。

Something like this:像这样的东西：

from warnings import warn

Debug = True

def read_lines_from(file):
    """Read and split lines from file. This is a separate function, instead
       of just using file.readlines(), in case extra work is needed like
       dos-to-unix conversion inside a unix environment.
    """
    with open(file) as f:
        text = f.read()
        lines = text.split('\n')

    return lines

def parse_file(file):
    """Parse file in format given by 
        https://stackoverflow.com/questions/54520331
    """
    lines = read_lines_from(file)
    state = 'ER'
    records = []
    current = None

    for line_no, line in enumerate(lines):
        tag, rest = line[:2], line[3:]

        if Debug:
            print(F"State: {state}, Tag: {tag}, Rest: {rest}")

        # Skip empty lines
        if tag == '':
            if Debug:
                print(F"Skip empty line at {line_no}")
            continue

        if tag == '  ':
            # Append text, except in ER state.
            if state != 'ER':
                if Debug:
                    print(F"Append text to {state}: {rest}")
                current[state].append(rest)
            continue

        # Found a tag. Process it.

        if tag == 'ER':
            if Debug:
                print("Tag 'ER'. Completed record:")
                print(current)

            records.append(current)
            current = None
            state = tag
            continue

        if tag == 'FN':
            if state != 'ER':
                warn(F"Found 'FN' tag without previous 'ER' at line {line_no}")
                if len(current.keys()):
                    warn(F"Previous record (FN:{current['FN']}) discarded.")

            if Debug:
                print("Tag 'FN'. Create empty record.")

            current = {}

        # All tags except ER get this:
        if Debug:
            print(F"Tag '{tag}'. Create list with rest: {rest}")

        current[tag] = [rest]
        state = tag

    return records

if __name__ == '__main__':
    records = parse_file('input.txt')
    print('Records =', records)

Python文本提取

问题描述

2 个解决方案

解决方案1
1 2019-02-04 16:37:05

解决方案2
0 2019-02-04 17:50:56

Python文本提取

问题描述

2 个解决方案

解决方案1 1 2019-02-04 16:37:05

解决方案2 0 2019-02-04 17:50:56

解决方案1
1 2019-02-04 16:37:05

解决方案2
0 2019-02-04 17:50:56