[英]how to scrape websites that using django
I wanted to create a robot to scrape a website with this address :我想创建一个机器人来抓取具有以下地址的网站:
https://1xxpers100.mobi/en/line/ https://1xxpers100.mobi/en/line/
But the problem is that when I wanted to get data from this website I realized that this website is using django because they are using phrases like {{if group_name}} and others但问题是,当我想从该网站获取数据时,我意识到该网站正在使用 django,因为他们使用了 {{if group_name}} 等短语
there is a loop created with this kind of method and it creates table rows and the information that I want is there.有一个用这种方法创建的循环,它创建表行,我想要的信息就在那里。
when I am working with python and I download the html code I can't find any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want当我使用 python 并下载 html 代码时,除了“{{code}}”之外,我找不到任何内容,但是当我使用 chrome 开发人员工具(检查)以及使用控制台时,我可以看到我想要的表格里面的内容
How can I get html codes that holds the content of that table like chrome tools to get the information that I want from this website?如何获取包含该表内容的 html 代码(如 chrome 工具)以从该网站获取我想要的信息?
My way to get the codes is using python :我获取代码的方法是使用 python :
import urllib.request
fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
This should work for what you want:这应该适用于您想要的:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.encode("utf-8"))
here 'lmxl'
is what I use because it worked for the site I tested it on.这里
'lmxl'
是我使用的,因为它适用于我测试过的网站。 If you have trouble with that just try another parser.如果您遇到问题,请尝试另一个解析器。
another problem is that there is a character that isn't recognized by default.另一个问题是默认情况下无法识别一个字符。 so read the contents of
soup
using utf-8
所以使用
utf-8
读取soup
的内容
Extra Info额外信息
This has nothing to do with django.这与django无关。 HTML has what is described as a "tree" like structure.
HTML 具有所谓的“树”状结构。 Where each set of tags is the parent of all children tags immediately inside it.
每组标签都是紧邻其中的所有子标签的父标签。 You just weren't reading deep enough into the tree.
你只是对树的阅读不够深入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.