简体   繁体   中英

Using Pandas and Regex to search through and extract values of a txt file

I have 2 data tables that I am attempting to extract values from. Here is my current script.

import re 
import os
import pandas as pd

os.chdir('C:/Users/Sams PC/Desktop')

test1=pd.read_csv('test1.txt', sep='\s+', header=None)
test1.columns=['Column_1','Column_2','Column_3']
test2=pd.read_csv('test2.txt', sep='\s+', header=None)
test2.columns=['Column_1','Column_2','Column_3','Column_4']

if 'S31N' in test1:
    data2=nhsqc[['Column_1','Column_2']].copy()
    if 'S31N-CA-HN' in test2:
        data2=nhsqc[['Column_3']].copy()
    else:
        print('Not Found')      
else:
    print('Not Found')


print(test1)
print (test2)

With this output:

Not Found
0  S31N-HN   114.424     7.390
1  Y32N-HN   121.981     7.468
           Column_1  Column_2  Column_3  Column_4
0  S31N-A30CA-S31HN   114.424    54.808     7.393
1  S31N-A30CA-S31HN   126.854    53.005     9.277
2        S31N-CA-HN   114.424    61.717     7.391
3        S31N-HA-HN   126.864    59.633     9.287
4  Y32N-S31CA-Y32HN   121.981    61.674     7.467
5        Y32N-CA-HN   121.981    60.789     7.469
6  Q33N-Y32CA-Q33HN   120.770    60.775     8.582

I am able to organize the tables using pandas. Next I want to extract values from columns associated with say 'S31N'. However, as you can see, my if line is not working in regards to finding S31N, even though it does exist in my data table. Now if I changed that value to my header (if 'Column_1' in test1:), then it will work. I don't exactly understand why it's unable to search the actual table, and is only searching the column headers.

Furthermore, while my if line does work (if I used the column header), the 2nd if line overwrites the data2 table from the first if line. How can I have it be added to data2 as an extra column rather than overwriting it.

I removed the 2nd half since the issue was resolved. However the main issue still stands, my script is still unable to find my values. Updated script:

x=re.findall('[A-Z][0-9][0-9][A-Z]',str(test1))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]',str(test2))
print (x,y)

for i in range (0,2):
    if x[i] in test1:
        data2=nhsqc[['Column_1','Column_2']].copy()
        if y[i] in test2:
            data2=nhsqc[['Column_3']].copy()
            print (data2)
        else:   
            print('Not Found')      
    else:
        print('Not Found')


print(x[i])

Output:

['S31N', 'Y32N'] ['S31N-CA', 'Y32N-CA']
Not Found
Not Found
Y32N

I guess, this might get you closer. The problem is likely about the type of test1 and test2 , which changing those throughout your code, str(test1) or str(test1) might be one way to make it work.

Test

x=re.findall('[A-Z][0-9][0-9][A-Z]',str(test1))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]',str(test2))
print (x,y)

for i in range (0,2):
    if x[i] in str(test1):
        data2=nhsqc[['Column_1','Column_2']].copy()
        if y[i] in str(test2):
            data2=nhsqc[['Column_3']].copy()
            print (data2)
        else:   
            print('Not Found')      
    else:
        print('Not Found')


print(x[i])

Simulated Test

import re
test1 = '''
0  S31N-HN   114.424     7.390
1  Y32N-HN   121.981     7.468
'''

test2 = '''
           Column_1  Column_2  Column_3  Column_4
0  S31N-A30CA-S31HN   114.424    54.808     7.393
1  S31N-A30CA-S31HN   126.854    53.005     9.277
2        S31N-CA-HN   114.424    61.717     7.391
3        S31N-HA-HN   126.864    59.633     9.287
4  Y32N-S31CA-Y32HN   121.981    61.674     7.467
5        Y32N-CA-HN   121.981    60.789     7.469
6  Q33N-Y32CA-Q33HN   120.770    60.775     8.582
'''

x = re.findall('[A-Z][0-9][0-9][A-Z]', str(test1))
y = re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]', str(test2))
print(x, y)

for i in range(0, 2):
    if x[i] in str(test1):
        print(x[i])
        data2 = nhsqc[['Column_1', 'Column_2']].copy()
        if y[i] in str(test2):
            data2 = nhsqc[['Column_3']].copy()
            print(y[i])
        else:
            print('Not Found')
    else:
        print('Not Found')

Output

['S31N', 'Y32N'] ['S31N-CA', 'Y32N-CA']
S31N
S31N-CA
Y32N
Y32N-CA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM