I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles = []
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
your regex is '^C[0-9]{9}$'
^ start of line
C exact match
[0-9] any digit
{9} 9 times
$ end of line
import re
regex = re.compile('(^C\d{9})')
matches = []
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(C\d{9})',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search = {}
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(C\d{9})',i) for i in f]
search.update({f.name:data})
print(search)
This would return a dictionary with file names as keys and a list of found matches.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.