简体   繁体   English

如何抓取使用 django 的网站

[英]how to scrape websites that using django

I wanted to create a robot to scrape a website with this address :我想创建一个机器人来抓取具有以下地址的网站:

https://1xxpers100.mobi/en/line/ https://1xxpers100.mobi/en/line/

But the problem is that when I wanted to get data from this website I realized that this website is using django because they are using phrases like {{if group_name}} and others但问题是,当我想从该网站获取数据时,我意识到该网站正在使用 django,因为他们使用了 {{if group_name}} 等短语

there is a loop created with this kind of method and it creates table rows and the information that I want is there.有一个用这种方法创建的循环,它创建表行,我想要的信息就在那里。

when I am working with python and I download the html code I can't find any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want当我使用 python 并下载 html 代码时,除了“{{code}}”之外,我找不到任何内容,但是当我使用 chrome 开发人员工具(检查)以及使用控制台时,我可以看到我想要的表格里面的内容

How can I get html codes that holds the content of that table like chrome tools to get the information that I want from this website?如何获取包含该表内容的 html 代码(如 chrome 工具)以从该网站获取我想要的信息?

My way to get the codes is using python :我获取代码的方法是使用 python :

import urllib.request

fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

This should work for what you want:这应该适用于您想要的:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')

print(soup.encode("utf-8"))

here 'lmxl' is what I use because it worked for the site I tested it on.这里'lmxl'是我使用的,因为它适用于我测试过的网站。 If you have trouble with that just try another parser.如果您遇到问题,请尝试另一个解析器。

another problem is that there is a character that isn't recognized by default.另一个问题是默认情况下无法识别一个字符。 so read the contents of soup using utf-8所以使用utf-8读取soup的内容

Extra Info额外信息

This has nothing to do with django.这与django无关。 HTML has what is described as a "tree" like structure. HTML 具有所谓的“树”状结构。 Where each set of tags is the parent of all children tags immediately inside it.每组标签都是紧邻其中的所有子标签的父标签。 You just weren't reading deep enough into the tree.你只是对树的阅读不够深入。

HTML

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM