从pastebin Python编译链接列表

Question

目前正在尝试使用python提取pastebin的链接。 到目前为止，我有：

from bs4 import BeautifulSoup
import re
import requests
from random import randint
import time
from lxml import etree
from time import sleep
import random

a = requests.get('http://pastebin.com/JGM3p9c9')
scrape = BeautifulSoup(a.text, 'lxml')
linkz = scrape.find_all("textarea", {"id":"paste_code"})

rawlinks = str(linkz)
partition1 = rawlinks.partition('\\r')[0]
links = partition1.partition('">')[-1]

我似乎无法让python编译所有http://格式的链接，但只能编译第一个...使用正则表达式，我发现在线无法正常工作

最终目标，我试图将链接放入列表中，在其中可以将请求发送到我编译的列表中的所有链接。

Answer 1

首先，您不必提取完整的标记并将其更改为str 。 更好的方法是：

#                      `next` to extract content within tag v
#    instead use `find` v                                   v
>>> my_links = scrape.find("textarea", {"id":"paste_code"}).next

my_links将保存值：

u'http://www.walmart.com\r\nhttp://www.target.com\r\nhttp://www.lowes.com\r\nhttp://www.sears.com'

为了将此字符串转换为所需的链接list ，可以将\\r\\n上的字符串拆分为：

>>> my_links.split('\r\n')
[u'http://www.walmart.com', u'http://www.target.com', u'http://www.lowes.com', u'http://www.sears.com']

Answer 2

您需要浏览HTML的几层，但是我看了一下pastebin页面，我认为这段代码将找到您想要的内容（抱歉，切换几个模块，我只是使用这些模块）

from bs4 import BeautifulSoup
import urllib.request

a = urllib.request.urlopen('http://pastebin.com/JGM3p9c9')
scrape = BeautifulSoup(a, 'html.parser')

x1 = scrape.find_all('div', id = 'selectable')
for x2 in x1:
    x3 = x2.find_all('li')
    for x4 in x3:
        x5 = x4.find_all('div')
        for x6 in x5:
            print(x6.string)

下次您需要抓取特定内容时，我建议通过右键单击并选择“检查元素”来查看网站的HML。 您也可以这样做：

print(scrape.prettify())

为了更好地了解如何嵌套HTML。

Answer 3

忘记使用BS来解析HTML了-在这种情况下，您可以直接获取PasteBin的内容，并将其变成一行。

import requests
links = [link.strip() for link in requests.get('http://pastebin.com/raw/JGM3p9c9').text.split('\n')]

您也可以在\\r\\n

从pastebin Python编译链接列表

问题描述

3 个解决方案

解决方案1
1 已采纳 2017-01-13 22:43:58

解决方案2
1 2017-01-13 22:55:35

解决方案3
1 2017-01-13 23:03:55

从pastebin Python编译链接列表

问题描述

3 个解决方案

解决方案1 1 已采纳 2017-01-13 22:43:58

解决方案2 1 2017-01-13 22:55:35

解决方案3 1 2017-01-13 23:03:55

解决方案1
1 已采纳 2017-01-13 22:43:58

解决方案2
1 2017-01-13 22:55:35

解决方案3
1 2017-01-13 23:03:55