I am scraping pages of 7 domains out of 8. I get the output I want but - for some reason - the same output is generated 7 times instead of just once. The simplified code is here:
def firstpage(pp):
city = [0, 1, 2, 3, 4, 5, 6, 7]
p1 = []
pp = pd.DataFrame()
for i in city:
response = i
if response > 0:
p = ['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15',
'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23']
for a in p:
page = str(a)
page = 'https://www.uno.com/' + str(i) + '/' + page
p1.append(page)
else:
print("error")
pp = pd.DataFrame(p1)
pp.columns = ['Links']
pp.to_csv('Test.csv', sep=',')
return pp
AllFirstPages = pd.DataFrame()
%timeit firstpage(AllFirstPages)
I tried also with the pp block right after the p1.append(page)
The same thing is happening: the output is correct but it is running through the loop multiple times, which makes it inefficient.
The correct output is
What I am doing wrong? Why is the loop going 6 times more giving the same output?
I am thinking to have the pandas dataframe outside the loop but how do I do that in the function?
Thanks!
I think you are getting confused writing a function with no input parameter (you aren't using the "pp" parameter as an input in your function), then trying to force it outside of the function. Other than some strange design choices, your code works fine like this:
def firstpage():
city = [0, 1, 2, 3, 4, 5, 6, 7]
p1 = []
for i in city:
response = i
if response > 0:
p = ['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15',
'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23']
for a in p:
page = str(a)
page = 'https://www.uno.com/' + str(i) + '/' + page
p1.append(page)
else:
print("error")
return p1
print(firstpage())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.