简体   繁体   中英

Parsing and running commands from a batch file into an interactive shell

I'm trying to run scrapy shell from a batch file and what works so far is initiating the batch file and the interactive shell. then, I'd like to be able to parse commands into the scrapy console (the command lines after shelp() ).

my code:

call C:/Users/<user_name>/Anaconda3/Scripts/activate.bat 
scrapy shell <url>

< printing log stats >

2020-03-09 13:38:33 [asyncio] DEBUG: Using selector: SelectSelector
In [1]:  

# that's where it stops and the command below is what I want to be run

from scrapy.http import FormRequest

How do I make it parse and run the last command from the batch file?

I realize that this question may seem specific to scrapy users, but the answer I have found should be applicable to a similar case using another interactive shell, hence why I will post it here.

The solution is piping. I created one batch file that is to be run and one text file with the commands to be passed into the interactive scrapy shell.

Here's the code of the .bat file:

  call C:/Users/<user_name>/Anaconda3/Scripts/activate.bat

  type commands_urls.txt | scrapy shell <fetched_url>

Simple, right?

Then in the text file I save all the commands, such as downloading packages, fetching the url, saving data from the response to a CSV.

HOWEVER, this produces another problem; running over a list of ajax requests using scrapy works fine, but it does not save the results as it's supposed to (passing the same commands into a separate shell (the initial scenario) works normally.

I'll give an example:

import json
import pandas as pd
from bs4 import BeautifulSoup
import time

ajax_URL = <the_ajax_URL>

def preparePayload(number):
    """
    Parameters that will be parsed into the request.
    """
    payload = {
        "search_params[paged]" : number,
        "action" : "search_results",    
    }
    return payload

urls = []
#an exemplary range, the actual number of requests is different
for i in range(1,5):

  print("\nCurrent page number: "+str(i))
  time.sleep(5)
  req = FormRequest(ajax_URL, formdata=preparePayload(str(i)), dont_filter=True)
  fetch(req)

  jsonresponse = json.loads(response.body_as_unicode())

  result = jsonresponse['data']
  soup = BeautifulSoup(result, "html.parser")
  items = soup.select("div[class='item-content']")
  if(len(items)==0):
      print("\n\nEnd of results!\n\n")
      break

  for item in items:
      urls.append(item.select_one("a")['href'])
      print(item.select_one("a")['href'])

urls_df = pd.DataFrame()
urls_df['url'] = urls
urls_df.to_csv("test1.csv",  mode='a', header=False, index=False)

The above works 100% fine when pasting the commands into the scrapy shell. Fetching data with scrapy for one request should work fine too. However, when parsing it into a batch file, it goes over the requests, finds the right data but then only the last iteration gets saved outside of the loop.

I tried parsing the data both as a Series and a list, both were unsuccessful.

I couldn't find any similar issue with batch files on SO. What am I missing here?

Any suggestions would be appreciated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM