简体   繁体   中英

facing issue with “wget” in python

I am very novice to python. I am facing issue with "wget" as well as " urllib.urlretrieve(str(myurl),tail)"

when I run script it's downloading files but filename are ending with "?"

my complete code :

import os
import wget
import urllib
import subprocess
with open('/var/log/na/na.access.log') as infile, open('/tmp/reddy_log.txt', 'w') as outfile:
    results = set()
    for line in infile:
        if ' 200 ' in line:
            tokens = line.split()
            results.add(tokens[6]) # 7th token
    for result in sorted(results):
        print >>outfile, result
with open ('/tmp/reddy_log.txt') as infile:
     results = set()
     for line in infile:
     head, tail = os.path.split(line)
                print tail
                myurl = "http://data.xyz.com" + str(line)
                print myurl
                wget.download(str(myurl))
                #  urllib.urlretrieve(str(myurl),tail)

output :

# python last.py
0011400026_recap.xml

http://data.na.com/feeds/mobile/android/v2.0/video/games/high/0011400026_recap.xml

latest_1.xml

http://data.na.com/feeds/mobile/iphone/article/league/news/latest_1.xml

currenttime.js

Listing the files :

# ls
0011400026_recap.xml?                   currenttime.js?  latest_1.xml?      today.xml?

A possible explanation of the behaviour you experience is that you do not sanitize your input line

 with open ('/tmp/reddy_log.txt') as infile: ... for line in infile: ... myurl = "http://data.xyz.com" + str(line) wget.download(str(myurl)) 

When you iterate on a file object, ( for line in infile: ) the string you get is terminated by a newline ( '\\n' ) character — if you do not remove the newline before using line , oh well, the newline character is still there in what is produced by your use of line

As an illustration of this concept, have a look at the transcript of a test I've done

08:28 $ cat > a_file
a
b
c
08:29 $ cat > test.py
data = open('a_file')
for line in data:
    new_file = open(line, 'w')
    new_file.close() 
08:31 $ ls
a_file  test.py
08:31 $ python test.py
08:31 $ ls
a?  a_file  b?  c?  test.py
08:31 $ ls -b
a\n  a_file  b\n  c\n  test.py
08:31 $

As you can see, I read lines from a file and create some files using line as the filename and guess what, the filenames as listed by ls have a ? at the end — but we can do better, as it's explained in the fine manual page of ls

  -b, --escape print C-style escapes for nongraphic characters 

and, as you can see in the output of ls -b , the filenames are not terminated by a question mark (it's just a placeholder used by default by the ls program) but are terminated by a newline character.

While I'm at it, I have to say that you should avoid to use a temporary file to store the intermediate results of your computation.

A nice feature of Python is the presence of generator expressions , if you want you can write your code as follows

import wget

# you matched on a '200' on the whole line, I assume that what
# you really want is to match a specific column, the 'error_column'
# that I symbolically load from an external resource
from my_constants import error_column, payload_column

# here it is a sequence of generator expressions, each one relying
# on the previous one

# 1. the lines in the file, stripped from the white space
#    on the right (the newline is considered white space)
#    === not strictly necessary, just convenient because
#    === below we want to test for non-empty lines
lines = (line.rstrip() for line in open('whatever.csv'))

# 2. the lines are converted to a list of 'tokens' 
all_tokens = (line.split() for line in lines if line)

# 3. for each 'tokens' in the 'all_tokens' generator expression, we
#    check for the code '200' and possibly generate a new target
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

# eventually, use the 'targets' generator to proceed with the downloads
for target in targets: wget.download(target)

Don't be fooled by the amount of comments, w/o comments my code is just

import wget
from my_constants import error_column

lines = (line.rstrip() for line in open('whatever.csv'))
all_tokens = (line.split() for line in lines if line)
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

for target in targets: wget.download(target)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM