简体   繁体   English

Web在python中抓取多个站点

[英]Web scraping multiple sites in python

I signed up to this website just to ask this question as I have been searching for hours over multiple days and haven't found anything. 我注册这个网站只是为了问这个问题,因为我已经连续数小时搜寻了数小时,却没有发现任何东西。 I am trying to, within 10 seconds, scrape the 2-3 characters from 5 websites, combine them, and paste them into a box. 我试图在10秒内从5个网站中抓取2-3个字符,将它们合并,然后粘贴到一个盒子中。 I have a rough idea of what I would need, but no idea how to go about this. 我对需要什么有一个大概的了解,但不知道如何解决。 I believe I want to assign variables the scraped contents from a certain website, and then get it to print the combination of these variables for me to copy and paste. 我相信我想为某个网站上的抓取内容分配变量,然后让它打印这些变量的组合,以便我复制和粘贴。 I'm not an expert by any means in Python, so if possible, a copy/pasteable script would be great. 我不是Python方面的专家,因此,如果可能的话,复制/粘贴脚本会很棒。 The websites are: https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D Keeping this up now only because I cannot take it down. 这些网站是: https : //assess.joincyberdiscovery.com/challenge-files/clock-pt1 ? verify = BY% 2F8lhw% 2BtbBgvOMDiHeB5A% 3D% 3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt2 ?verify = BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https: // asses。 / challenge-files / clock-pt4?verify = BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A现在这只是因为我不能接受。 Thank you to those who have helped, I hope this helps someone else. 感谢您的帮助,希望对您有所帮助。 Sorry for being dumb 对不起,你傻了

Thing is, I've done the code and tried it. 问题是,我已经完成了代码并进行了尝试。 It works, but that isn't the answer to the question. 它有效,但这不是问题的答案。 Getting the characters from the links and putting them together doesn't work. 从链接中获取字符并将它们放在一起是行不通的。 I've tried many things and I am still working it out myself. 我已经尝试了很多事情,但我仍在自己解决。 My advice, work it out yourself. 我的建议,自己解决。 It's a lot more rewarding and will probably help for future parts of the competition. 这会带来更多收益,并且可能会在未来的比赛中有所帮助。 Also, if you ever think about removing all of the 'a's from the code, that doesn't work either. 另外,如果您曾经考虑从代码中删除所有的'a',那也不起作用。 I tried. 我试过了。

To answer your stack overflow question, here is the code (you need to install the 'requests' python modeule first): 要回答您的堆栈溢出问题,下面是代码(您需要首先安装“ requests” python模式):

import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page1_content = requests.get(page1)
page1text = page1_content.text

page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page2_content = requests.get(page2)
page2text = page2_content.text

page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page3_content = requests.get(page3)
page3text = page3_content.text

page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page4_content = requests.get(page4)
page4text = page4_content.text

page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
page5_content = requests.get(page5)
page5text = page5_content.text

print(page1text + page2text + page3text + page4text + page5text)

But this method doesn't answer challenge 14. 但是这种方法不能解决挑战14。

I have done something very similar with just as poor results at the end. 我做了一些非常相似的事情,但结果却很糟糕。 I did, however, leave this running for a while and notice that the clock follow a pattern. 但是,我确实将其运行了一段时间,并注意到时钟遵循一种模式。 Some time ago the clock read all as "aaaaaaaaaaaaaaa" then "aBaa1aafaa2aa3a" and "aDaafaaHaajaala". 前段时间的时钟读为“ aaaaaaaaaaaaaaaaaaa”,然后分别是“ aBaa1aafaa2aa3a”和“ aDaafaaHaajaala”。 I'm going to wait for a full list and try suggesting the next clock sequence in the final URL. 我将等待完整的列表,并尝试在最终URL中建议下一个时钟序列。 I'll get back to you if this works, just something to think about. 如果可行,我会尽快与您联系,请您考虑一下。

Also for help importing moduals I suggest : https://programminghistorian.org/lessons/installing-python-modules-pip & https://docs.python.org/3/installing/index.html 同样对于导入模态的帮助,我建议: https ://programminghistorian.org/lessons/installing-python-modules-pip和https://docs.python.org/3/installing/index.html

import requests
abc = ""
while 1 == 1 :
    page1 = requests.get('your first link')
    page2 = requests.get('your second link')
    page3 = requests.get('your thrid link')
    page4 = requests.get('your fourth link')
    page5 = requests.get('your fith link')
    text = page1.text+page2.text+page3.text+page4.text+page5.text

    # abc1 = "the verify link except clock pts is replaced with "+"text>" so the end looks like this :string=<"+text+">"
    abc1 = text
    if abc1 != abc:
       print (abc1)
       abc = abc1

Edit The clock runs in 15-minute cycles with 90 codes altogether Im not sure how this helps as of yet but just posting ideas. 编辑时钟以15分钟为一个周期运行,共包含90个代码。我不确定目前为止它如何帮助您,但只是发布想法。 I had to make some changes to get the codes to output cleanly and here is my improved version (this is very messy sorry): 我必须进行一些更改才能使代码清晰地输出,这是我的改进版本(对不起,对不起):

import requests
abc = ""
page1 = requests.get('your first link')
page2 = requests.get('your second link')
page3 = requests.get('your thrid link')
page4 = requests.get('your fourth link')
page5 = requests.get('your fith link')
while 1 == 1 :
    page12 = requests.get('your first link')
    page22 = requests.get('your second link')
    page32 = requests.get('your thrid link')
    page42 = requests.get('your fourth link')
    page52 = requests.get('your fith link')
    if page1.text != page12.text and page2.text != page22.text and page3.text != page32.text and page4.text != page42.text and page5.text != page52.text:


        text = page12.text+page22.text+page32.text+page42.text+page52.text
        abc1 = text
        # abc1 = * your url for verification   with * string=<"+text+">"
        if abc1 != abc:
            print (abc1)
            abc = abc1
            page1 = page12
            page2 = page22
            page3 = page32
            page4 = page42
            page5 = page52

Final edit I had sepnt so long going down the path of figuring out how that made the tak and doing way too much work. 最后的编辑让我隔了好久,一直走下去,弄清楚这是如何使Tak进行过多工作的。 When Submitting the final url dont incluede your solutin as a repalcement for the section and NOT inside the <> so yours should like https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=*this is an identifiere*&string=*The string you get* 提交最终URL时,请勿将您的solutin包含在本节中,而不要包含在<>中,因此您应该使用https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=*this is an identifiere*&string=*The string you get*

I know the answer to the question, but instead of giving the code to complete it, I'll tell you one of the ways you might find it, as I completed that question myself. 我知道该问题的答案,但是,我自己完成此问题时,我将告诉您一种可能找到它的方法,而不是提供完成该问题的代码。

When you asked this question, you completely forgot to mention that there was a sixth link: https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string=%3Cclock%20pts%3E 当您问这个问题时,您完全忘了提到第六个链接: https : //assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string=%3Cclock%20pts%3E

Notice at the end of that hyperlink it says 'clock pts', whereas all the other links have had something like clock-pt1 or clock-pt4. 请注意,在该超链接的末尾显示“ clock pts”,而所有其他链接都有诸如clock-pt1或clock-pt4之类的内容。 What if the clock pts refers to all of the different links at once such as you have to create a string out of all the previous links you've been given, replace the 'clock pts' in the string section of the hyperlink WITH the string you made from the separate links, which would then give you the code to complete the level? 如果Clock pts一次引用了所有不同的链接,例如您必须从之前获得的所有链接中创建一个字符串,将超链接的string部分中的'clock pts'替换为该字符串怎么办?您是通过单独的链接创建的,那么该链接会为您提供完成关卡的代码?

Below is the code I used to get the answer. 以下是我用来获取答案的代码。 It requires the requests module, in case you want to use it. 如果要使用它,它需要请求模块。 (Also, I'm not 100% certain it will work all the time, since the challenge is based on a timer, the program may not get all the strings in time before the clock change, so make sure to run the program after the timer has reset) (另外,我也不是100%肯定它会一直工作,因为挑战是基于计时器的,因此程序可能无法在时钟更改之前及时获得所有字符串,因此请确保在计时器已重置)

    import requests
    page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
    page1_content = requests.get(page1)
    page1text = page1_content.text

    page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
    page2_content = requests.get(page2)
    page2text = page2_content.text

    page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
    page3_content = requests.get(page3)
    page3text = page3_content.text

    page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
    page4_content = requests.get(page4)
    page4text = page4_content.text

    page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=4VjvSgWQQ8yhhiYD9cePtg%3D%3D"
    page5_content = requests.get(page5)
    page5text = page5_content.text

    code=(page1text + page2text + page3text + page4text + page5text)

    page6= "https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=j7fPvtmWLDY5qeYFuJtmKw%3D%3D&string="+code
    page6_content = requests.get(page6)
    print(page6_content.text)

I completed the challenge, I used an excel spreadsheet with functions to get all the little code things from every clock cycle and put them together to make one code every 10 seconds. 我完成了挑战,我使用了一个具有功能的excel电子表格,可以从每个时钟周期获取所有小的代码,然后将它们放在一起,每10秒生成一个代码。 Sorry if that doesn't make sense I'm not sure how to explain it. 抱歉,如果那没有道理,我不确定该如何解释。 Then I pasted this into the end of the "validation link" to replace the < clock pts > at the end of the URL. 然后,将其粘贴到“验证链接”的末尾,以替换URL末尾的<Clock pts>。 I had to do this very fast before the clock reset. 在重置时钟之前,我必须非常快地执行此操作。 Very stressful haha. 非常紧张哈哈。 Then eventually I did this in time and it gave me the code. 然后最终我及时地做到了,它给了我代码。 I hope this helps. 我希望这有帮助。 But you'll have to figure out how to get all the codes together in under 10 seconds by yourself, otherwise this is basically cheating, right? 但是您必须弄清楚如何在10秒内自行收集所有代码,否则这基本上是作弊的,对吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM