简体   繁体   中英

Downloading Data From .txt file containing URLs with Python again

I am currently trying to extract the raw data from a .txt file of 10 urls, and put the raw data from each line(URL) in the .txt file. And then repeat the process with the processed data(the raw data from the same original .txt file stripped of the html) by using Python.

import commands
import os
import json

# RAW DATA
input = open('uri.txt', 'r')
t_1 = open('command', 'w')
counter_1 = 0

for line in input:
    counter_1 += 1
if counter_1 < 11:
    filename = str(counter_1)
    print str(line)
filename= str(count)
command ='curl ' + '"' + str(line).rstrip('\n') + '"'+ '> ./rawData/' + filename

output_1 = commands.getoutput(command)
input.close()

# PROCESSED DATA
counter_2 = 0
input = open('uri.txt','r')
t_2 = open('command','w')
for line in input:
    counter_2 += 1
    if counter_2 <11:
      filename = str(counter_2) + '-processed'
      command = 'lynx -dump -force_html ' + '"'+ str(line).rstrip('\n') + '"'+'> ./processedData/' + filename
    print command
output_2 = commands.getoutput(command)
input.close()

I am attempting to do all of this with one script. Can anyone help me refine my code so I can run it? it should loop through the code completely once for each kind line in the .txt file. For example, I should have 1 raw & 1 processed .txt file for every url line in my .txt file.

Break your code up into functions. Currently the code is hard to read and debug. Make a function called get_raw() and a function called get_processed() . Then for your main loop, you can do

for line in file:
    get_raw(line)
    get_processed(line)

Or something similar. Also you should avoid using 'magic numbers' like counter<11 . Why is it 11? Is it the number of the lines in the file? If it is you can get the number of lines with len() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM