Python 打印 csv 列值之前 output 的每个结果的值而不重复

Question

I have a Python script that imports a list of url's from a CSV named list.csv, scrapes them and outputs any anchor text and href links found on each url from the csv: I have a Python script that imports a list of url's from a CSV named list.csv, scrapes them and outputs any anchor text and href links found on each url from the csv:

(For reference the list of urls in the csv are all in column A) （供参考，csv 中的 url 列表都在 A 列中）

from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pandas
import csv

contents = []
with open('list.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        contents.append(url) # Add each url to list contents
    

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")

    for link in soup.find_all('a'):
        if len(link.text)>0:
            print(url, link.text, '-', link.get('href'))

The output results look something like this where https://www.example.com/csv-url-one/ and https://www.example.com/csv-url-two/ are the url's in column A in the csv: The output results look something like this where https://www.example.com/csv-url-one/ and https://www.example.com/csv-url-two/ are the url's in column A in the csv ：

['https://www.example.com/csv-url-one/'] Creative - https://www.example.com/creative/
['https://www.example.com/csv-url-one/'] Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/'] PPC - https://www.example.com/ppc/
['https://www.example.com/csv-url-two/'] SEO - https://www.example.com/seo/

The issue is i want the output results to look more like this ie not repeatedly print the url in the CSV before each result AND have a break after each line from the CSV: The issue is i want the output results to look more like this ie not repeatedly print the url in the CSV before each result AND have a break after each line from the CSV:

['https://www.example.com/csv-url-one/'] 
Creative - https://www.example.com/creative/
Web Design - https://www.example.com/web-design/

['https://www.example.com/csv-url-two/'] 
PPC - https://www.example.com/ppc/
SEO - https://www.example.com/seo/

Is this possible?这可能吗？

Thanks谢谢

Answer 1

Does the following solve your problem?以下是否解决了您的问题？

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")
    print('\n','********',', '.join(url),'********','\n')
    for link in soup.find_all('a'):
        if len(link.text)>0:
            print(link.text, '-', link.get('href'))

Answer 2

It is possible.有可能的。

Simply add \n at the end of print .只需在print末尾添加\n 。 \n is a break line special character. \n是换行符特殊字符。

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")

    for link in soup.find_all('a'):
        if len(link.text)>0:
            print(url, ('\n'), link.text, '-', link.get('href'), ('\n'),)

Answer 3

To add a separation between urls add a \n before printing each url.要在 url 之间添加分隔符，请在打印每个 url 之前添加一个\n 。

If you want to print the urls only if it has valid links ie if len(link.text)>0: , use the for loop to save valid links to a list, and only print url and links if this list is not empty.如果您只想打印具有有效链接的 url，即if len(link.text)>0: ，请使用 for 循环将有效链接保存到列表中，如果此列表不为空，则仅打印 url 和链接。

try this:尝试这个：

for url in contents: 
    page = urlopen(url[0]).read()
    soup = BeautifulSoup(page, "lxml")
    
    valid_links = []
    for link in soup.find_all('a'):
        if len(link.text)>0:
            valid_links .append(link.text)

    if len (valid_links ):
        print('\n', url)
        for item in valid_links :
            print(item.text, '-', item.get('href')))

Python 打印 csv 列值之前 output 的每个结果的值而不重复

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-03-11 14:34:34

解决方案2
0 2021-03-11 14:29:20

解决方案3
0 2021-03-11 14:52:39

Python 打印 csv 列值之前 output 的每个结果的值而不重复

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-03-11 14:34:34

解决方案2 0 2021-03-11 14:29:20

解决方案3 0 2021-03-11 14:52:39

解决方案1
1 已采纳 2021-03-11 14:34:34

解决方案2
0 2021-03-11 14:29:20

解决方案3
0 2021-03-11 14:52:39