简体   繁体   中英

Basic Python text extraction scenario

I am currently working with a text file that looks like this.

NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER

I would like to extract the number (just the integers) and save them all to a text file that would read:

6367283940
6367283940
6367283940

How would I go about doing this?

I am brand new.

There's perhaps a few ways you might approach this.

Regex

A simple regex pattern should work.

import re
text = """\
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
"""
pattern = '^NUMBER = (\d+)'

for number in re.findall(pattern, text):
    print(number)

6367283940
6367283940
6367283940

For an explanation of the regex, see this regex101 link .

String splitting

A more rudimentary way may be to use regular string operations, like .split

with open('mytext.txt') as f:
    for line in f:
        fields = line.split('|')
        number_field = fields[0]
        _, number = number_field.split(' = ')
        print(number)

Csv/pandas

Because your file is pipe-delimited, you could also use the csv module or pandas as Nuno Carvalho answered .

I suggest using pandas.

1 - Install the module.

pip install pandas

2 - Save that text in a file named "text.csv".

3 - Run this script

import pandas as pd

data = pd.read_csv("text.csv", header=None, sep="|")

print(data[0])

# Removing 'NUMBER = '
numbers = data[0].apply(lambda x: x.replace("NUMBER = ", ""))


# The output will be here
numbers.to_csv("your-numbers.csv", header=None, index=None)

Result:

your-numbers.csv

6367283940 
6367283940 
6367283940 

Firstly, you could open the text file by using the readlines method to extract the data in it as a list. Then loop through each element, split each element by a space and add the 3rd element which is the number in all cases, to the variable number , add \n or a new line each iteration as well. Finally, write the data into a text file.

with open("data.txt") as file:
    data = file.readlines()

numbers = ""
for char in data:
    numbers += char.split(" ")[2]
    numbers += "\n"

with open("numbers.txt", mode="w") as file:
    file.write(numbers)

This script should work if you name your text file input.txt . You can also change that in the code. I added some comments to make some steps clear for someone that isn't that experienced. I hope I could help you.

INPUT_FILE = "./input.txt"
OUTPUT_FILE = "./output.txt"


def main():
    result_numbers = []
    with open(INPUT_FILE) as file:                      # open the text file in read-only mode
        lines = file.readlines()                        # fetching all lines
        for i in lines:                                 # iterate through the lines
            first_row = i.split("|")[0].strip()         # we only need the first row and we don't need the extra spaces
            number = first_row.split("=")[1].strip()    # we need the part behind the = and we don't need the space before it
            result_numbers.append(number)               # add number to the result list
    with open(OUTPUT_FILE, "w") as file:                # open a new text file in write mode to save the results to it
        file.write("\n".join(result_numbers))           # join the results with a line break and write them to that file


if __name__ == '__main__':
    main()

If you have any questions, feel free to ask.

#input.txt is the input file and output.txt is the output file.

with open('input.txt') as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
filename='output.txt'
file_out=open(filename,'a')
import re
for x in lines:
    start = 'NUMBER = '
    end = 'FOOD'
    s = x
    result = re.search('%s(.*)%s' % (start, end), s).group(1)[:10 - 1]
    file_out.write(result+'\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM