简体   繁体   中英

Can't fetch some numbers from a website using requests

I'm trying to fetch some numbers from a webpage using requests. The numbers available in there are in images. The script I've written so far can show the numbers as I've used PIL library but can't print them.

website address

Numbers visible in there just above the submit button are like:

在此处输入图像描述

I've tried so far:

import io
import requests
from PIL import Image
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base = 'http://horoscope.horoscopezen.com/'
url = 'http://horoscope.horoscopezen.com/archive2.asp?day=2&month=1&year=2022&sign=1#.Xy07M4oza1v'

def get_numbers(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    image_links = [urljoin(base,td['src']) for td in soup.select("td > img[src^='secimage.asp?']")]
    for image_link in image_links:
        r = requests.get(image_link)
        img = Image.open(io.BytesIO(r.content))
        img.show()
        break

if __name__ == '__main__':
    get_numbers(url)

How can I fetch the numbers from that site?

You don't need to use OCR here. The image itself is composed of separate images for each number, and by parsing the image link you can get the entire number. The image link is of the form http://horoscope.horoscopezen.com/secimage.asp?I=1&N=595A5C585A5C It seems like the I= parameter is the index of the digit, and the N= parameter is the entire number. The translation seems to be as follows:

56 -> 9
57 -> 8
58 -> 7
59 -> 6
5A -> 5
5B -> 4
5C -> 3
5D -> 2
5E -> 1
5F -> 0

Note these numbers are in hex encoding (all characters are 0-9,AF). Since 0x56 corresponds to 9 and 0x5F to 0 (and 0x56 + 9 == 0x5F), to get the digit we could use the formula 9 - hex_num + 0x56 . For example, 56 would be converted to 9 - 0x56 + 0x56 = 9 and 5E would be translated to 9 - 0x5E + 0x56 = 9 - 8 = 1

So you could change your code to print the entire number using something like:

def url_to_number(url):
    all_digits = []
    # We want the encoded number, find '&N=' and get the characters after it
    N = url[url.find('&N=') + 3:]
    # loop the characters in pairs
    for i in range(0, len(N), 2):
        digit = 9 - int(N[i:i+2], 16) + 0x56
        all_digits.append(digit)
    return all_digits

The line digit = 9 - int(N[i:i+2], 16) + 0x56 does the conversion I mentioned earlier. int(N[i:i+2], 16) converts the number from string to int, given it is in base 16 (hexadecimal).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM