简体   繁体   中英

defined function used in for loop in python2.7

Whether have some problem for my python scripts?

from BeautifulSoup import BeautifulSoup
import requests
import re
from collections import defaultdict
import itertools
import pandas as pd

def wego(weburl,annot):
    print 'Go Term: ', weburl.split('=')[-1]
    html=requests.get(weburl).text
    soup=BeautifulSoup(html)
    desc=r"desc=\".*\""
    print "GO leave 2 term:",(re.findall(desc,str(soup))[0].split('"')[1])
    pattern=r"Unigene.*A"
    idDF = pd.DataFrame(columns=['GeneID']) #creates a new datafram
    idDF['GeneID'] = pd.Series(re.findall(pattern,str(soup))).unique()
    print "Total Go term is :",idDF.shape[0]
    old=pd.read_csv(annot,usecols=[0,7,8])
    getset=pd.merge(left=idDF,right=old,left_on=idDF.columns[0],\
    right_on=old.columns[0])
    updown=getset.groupby(getset.columns[1]).count()
    print updown
    print "Max P-value: ","{:.3e}".format(getset['P-value'].max())

with open("gourl.txt") as ur:
    d=[]
    for url in ur:
    we=wego(url,annot="file.csv")
    d.append(we)

my gourl.txt file is some url one line by one

http://stackoverflow.com/questions=1
http://stackoverflow.com/questions=2

my question is why the script can succeed when only one url in the gourl.txt file and failed when more than one?

The error follows:

IndexError: list index out of range
IndexErrorTraceback (most recent call last)
<ipython-input-79-a852fe95d69c> in <module>()
  2     d=[]
  3     for url in ur:
----> 4         we=wego(url,annot="file.csv")
  5         d.append(we)
<ipython-input-4-9fdf25e75434> in wego(weburl, annot)
  5     soup=BeautifulSoup(html)
  6     desc=r"desc=\".*\""
----> 7     print "GO leave 2 term:",(re.findall(desc,str(soup))
 [0].split('"')[1])
  8     pattern=r"Unigene.*A"
  9     idDF = pd.DataFrame(columns=['GeneID']) #creates a new dataframe 
 IndexError: list index out of range

If you look at the stack trace you gave us you can see the answer. The last line says that you are trying to access a list element that does not exist ('out of range') at

print "GO leave 2 term:",(re.findall(desc,str(soup))[0].split('"')[1])

You make 2 list accesses in this line. One to get the first matched pattern and one to get the second term produced by split('"') .

So probably the second url does not have this pattern you expect it to have.

You can use something like this:

matches = re.findall(desc, str(soup))
tokens = []
if matches:
    tokens = matches[0].split('"')
if len(tokens) > 1:
    print("GO leave 2 term:", tokens[1])

So happy that the question has the solution. The problem is \\n in my gourl.txt file when read. I will show follows:

>>> with open("wegourl.txt") as ur:
...     d=[]
...     for url in ur:
...         print url
...         

http://stackoverflow.com/questions=1

http://stackoverflow.com/questions=2

Undoubtedly, the empty line caused by the newline is not a licit URL and interrupted this scripts. I can modify just rid of \\n when read file : url=url.strip('\\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM