簡體   English   中英

如何從列表中的多個 URL 中抓取和提取相同的特定信息

[英]How to scrape and extract same specific information from multiple URLs in a list

我想從 250 部電影的列表中抓取電影的類型和長度(運行時間)。 名為“鏈接”的列表包含這 250 個電影頁面的 URL。 我編寫了一個代碼來從包含 250 個 URL 的列表“鏈接”中的單個 URL 中提取流派和長度。

links=['https://www.imdb.com/title/tt0093603/','https://www.imdb.com/title/tt8176054/','https://www.imdb.com/title/tt0367495/','https://www.imdb.com/title/tt0048473/','https://www.imdb.com/title/tt0079221/','https://www.imdb.com/title/tt7391996/','https://www.imdb.com/title/tt0052572/','https://www.imdb.com/title/tt0237376/','https://www.imdb.com/title/tt0214915/','https://www.imdb.com/title/tt5311546/','https://www.imdb.com/title/tt7019842/','https://www.imdb.com/title/tt0105575/','https://www.imdb.com/title/tt0400234/','https://www.imdb.com/title/tt8413338/','https://www.imdb.com/title/tt12361178/','https://www.imdb.com/title/tt4991384/','https://www.imdb.com/title/tt1187043/','https://www.imdb.com/title/tt8948790/','https://www.imdb.com/title/tt0986264/','https://www.imdb.com/title/tt10189514/','https://www.imdb.com/title/tt0101649/','https://www.imdb.com/title/tt5074352/','https://www.imdb.com/title/tt9477520/','https://www.imdb.com/title/tt7060344/','https://www.imdb.com/title/tt9900782/','https://www.imdb.com/title/tt0291855/','https://www.imdb.com/title/tt0048956/','https://www.imdb.com/title/tt0085743/','https://www.imdb.com/title/tt0050870/','https://www.imdb.com/title/tt7738784/','https://www.imdb.com/title/tt5959980/','https://www.imdb.com/title/tt0059246/','https://www.imdb.com/title/tt4987556/','https://www.imdb.com/title/tt0312859/','https://www.imdb.com/title/tt0072783/','https://www.imdb.com/title/tt0119385/','https://www.imdb.com/title/tt0292246/','https://www.imdb.com/title/tt10214826/','https://www.imdb.com/title/tt7019942/','https://www.imdb.com/title/tt3417422/','https://www.imdb.com/title/tt7465992/','https://www.imdb.com/title/tt5867800/','https://www.imdb.com/title/tt6148156/','https://www.imdb.com/title/tt8239946/',
'https://www.imdb.com/title/tt0466460/','https://www.imdb.com/title/tt0459516/','https://www.imdb.com/title/tt4679210/','https://www.imdb.com/title/tt0376127/','https://www.imdb.com/title/tt0066763/','https://www.imdb.com/title/tt3973410/','https://www.imdb.com/title/tt3668162/','https://www.imdb.com/title/tt0220656/','https://www.imdb.com/title/tt6380520/','https://www.imdb.com/title/tt0195231/','https://www.imdb.com/title/tt8108198/','https://www.imdb.com/title/tt4429128/','https://www.imdb.com/title/tt2877108/','https://www.imdb.com/title/tt2181831/','https://www.imdb.com/title/tt3569782/','https://www.imdb.com/title/tt0376076/','https://www.imdb.com/title/tt1954470/','https://www.imdb.com/title/tt1620933/','https://www.imdb.com/title/tt5312232/','https://www.imdb.com/title/tt2356180/','https://www.imdb.com/title/tt0242519/','https://www.imdb.com/title/tt4934950/','https://www.imdb.com/title/tt0367110/','https://www.imdb.com/title/tt0073707/','https://www.imdb.com/title/tt2218988/','https://www.imdb.com/title/tt0871510/','https://www.imdb.com/title/tt0375611/','https://www.imdb.com/title/tt0104561/','https://www.imdb.com/title/tt0054098/','https://www.imdb.com/title/tt1562872/','https://www.imdb.com/title/tt4430212/','https://www.imdb.com/title/tt4851630/','https://www.imdb.com/title/tt5005684/','https://www.imdb.com/title/tt10324144/','https://www.imdb.com/title/tt1639426/','https://www.imdb.com/title/tt0057935/','https://www.imdb.com/title/tt7060460/','https://www.imdb.com/title/tt1280558/','https://www.imdb.com/title/tt3322420/','https://www.imdb.com/title/tt4635372/','https://www.imdb.com/title/tt0242256/','https://www.imdb.com/title/tt0200087/','https://www.imdb.com/title/tt0374887/','https://www.imdb.com/title/tt0139876/','https://www.imdb.com/title/tt0292490/','https://www.imdb.com/title/tt0105271/','https://www.imdb.com/title/tt9052870/','https://www.imdb.com/title/tt2283748/','https://www.imdb.com/title/tt0405508/','https://www.imdb.com/title/tt0364647/','https://www.imdb.com/title/tt0169102/','https://www.imdb.com/title/tt1821480/','https://www.imdb.com/title/tt0109117/','https://www.imdb.com/title/tt8291224/','https://www.imdb.com/title/tt2338151/','https://www.imdb.com/title/tt2358592/','https://www.imdb.com/title/tt0453729/','https://www.imdb.com/title/tt0319736/','https://www.imdb.com/title/tt0843326/','https://www.imdb.com/title/tt2082197/','https://www.imdb.com/title/tt5571734/','https://www.imdb.com/title/tt0112553/','https://www.imdb.com/title/tt0379370/','https://www.imdb.com/title/tt8144834/','https://www.imdb.com/title/tt0488414/','https://www.imdb.com/title/tt0116630/','https://www.imdb.com/title/tt13299890/','https://www.imdb.com/title/tt0456144/','https://www.imdb.com/title/tt7822438/','https://www.imdb.com/title/tt5824826/','https://www.imdb.com/title/tt4849438/','https://www.imdb.com/title/tt0072860/','https://www.imdb.com/title/tt1695800/','https://www.imdb.com/title/tt2564144/','https://www.imdb.com/title/tt1261047/','https://www.imdb.com/title/tt0063404/','https://www.imdb.com/title/tt0471571/','https://www.imdb.com/title/tt7392212/','https://www.imdb.com/title/tt3390572/','https://www.imdb.com/title/tt0112870/','https://www.imdb.com/title/tt6315524/','https://www.imdb.com/title/tt5906392/','https://www.imdb.com/title/tt0213969/','https://www.imdb.com/title/tt2882328/','https://www.imdb.com/title/tt0050188/','https://www.imdb.com/title/tt1821317/','https://www.imdb.com/title/tt2377938/','https://www.imdb.com/title/tt7838252/','https://www.imdb.com/title/tt10919240/','https://www.imdb.com/title/tt1180583/','https://www.imdb.com/title/tt1773764/','https://www.imdb.com/title/tt3394420/','https://www.imdb.com/title/tt7725596/','https://www.imdb.com/title/tt2395469/','https://www.imdb.com/title/tt1327035/','https://www.imdb.com/title/tt3863552/','https://www.imdb.com/title/tt1649431/','https://www.imdb.com/title/tt0051792/','https://www.imdb.com/title/tt0220832/','https://www.imdb.com/title/tt1857670/','https://www.imdb.com/title/tt3614516/','https://www.imdb.com/title/tt7180544/','https://www.imdb.com/title/tt0296574/','https://www.imdb.com/title/tt7294534/','https://www.imdb.com/title/tt3449292/','https://www.imdb.com/title/tt11581174/','https://www.imdb.com/title/tt2585562/','https://www.imdb.com/title/tt1188996/','https://www.imdb.com/title/tt5082014/','https://www.imdb.com/title/tt3124456/',
 'https://www.imdb.com/title/tt8110330/',
 'https://www.imdb.com/title/tt0347304/',
 'https://www.imdb.com/title/tt1093370/',
 'https://www.imdb.com/title/tt2924472/',
 'https://www.imdb.com/title/tt1609168/',
 'https://www.imdb.com/title/tt6167894/',
 'https://www.imdb.com/title/tt0118751/',
 'https://www.imdb.com/title/tt7485048/',
 'https://www.imdb.com/title/tt2325915/',
 'https://www.imdb.com/title/tt0375878/',
 'https://www.imdb.com/title/tt1417299/',
 'https://www.imdb.com/title/tt7218518/',
 'https://www.imdb.com/title/tt0323013/',
 'https://www.imdb.com/title/tt8108200/',
 'https://www.imdb.com/title/tt2631186/',
 'https://www.imdb.com/title/tt0455829/',
 'https://www.imdb.com/title/tt0824316/',
 'https://www.imdb.com/title/tt0222012/',
 'https://www.imdb.com/title/tt11322920/',
 'https://www.imdb.com/title/tt3848892/',
 'https://www.imdb.com/title/tt10717738/',
 'https://www.imdb.com/title/tt4387040/',
 'https://www.imdb.com/title/tt5764096/',
 'https://www.imdb.com/title/tt0366840/',
 'https://www.imdb.com/title/tt2181931/',
 'https://www.imdb.com/title/tt1517561/',
 'https://www.imdb.com/title/tt0373856/',
 'https://www.imdb.com/title/tt2926068/',
 'https://www.imdb.com/title/tt2350496/',
 'https://www.imdb.com/title/tt1077248/',
 'https://www.imdb.com/title/tt0402014/',
 'https://www.imdb.com/title/tt13206926/',
 'https://www.imdb.com/title/tt8130968/',
 'https://www.imdb.com/title/tt0816258/',
 'https://www.imdb.com/title/tt6108090/',
 'https://www.imdb.com/title/tt4169250/',
 'https://www.imdb.com/title/tt0291376/',
 'https://www.imdb.com/title/tt2317337/',
 'https://www.imdb.com/title/tt0093578/',
 'https://www.imdb.com/title/tt7098658/',
 'https://www.imdb.com/title/tt4434004/',
 'https://www.imdb.com/title/tt1907761/',
 'https://www.imdb.com/title/tt7758160/',
 'https://www.imdb.com/title/tt0077451/',
 'https://www.imdb.com/title/tt4432480/',
 'https://www.imdb.com/title/tt1230165/',
 'https://www.imdb.com/title/tt0420332/',
 'https://www.imdb.com/title/tt3822396/',
 'https://www.imdb.com/title/tt1851988/',
 'https://www.imdb.com/title/tt5121000/',
 'https://www.imdb.com/title/tt1288638/',
 'https://www.imdb.com/title/tt0499375/',
 'https://www.imdb.com/title/tt0431619/',
 'https://www.imdb.com/title/tt2187153/',
 'https://www.imdb.com/title/tt0196069/',
 'https://www.imdb.com/title/tt2213054/',
 'https://www.imdb.com/title/tt3801314/',
 'https://www.imdb.com/title/tt1292703/',
 'https://www.imdb.com/title/tt4981966/',
 'https://www.imdb.com/title/tt1266583/',
 'https://www.imdb.com/title/tt1839596/',
 'https://www.imdb.com/title/tt0422320/',
 'https://www.imdb.com/title/tt7998242/',
 'https://www.imdb.com/title/tt2258337/',
 'https://www.imdb.com/title/tt0110222/',
 'https://www.imdb.com/title/tt0109555/',
 'https://www.imdb.com/title/tt6484982/',
 'https://www.imdb.com/title/tt4900716/',
 'https://www.imdb.com/title/tt3320542/',
 'https://www.imdb.com/title/tt7142506/',
 'https://www.imdb.com/title/tt1241195/',
 'https://www.imdb.com/title/tt8108268/',
 'https://www.imdb.com/title/tt0150433/',
 'https://www.imdb.com/title/tt2855648/',
 'https://www.imdb.com/title/tt0098999/',
 'https://www.imdb.com/title/tt0432047/',
 'https://www.imdb.com/title/tt3447364/',
 'https://www.imdb.com/title/tt1014672/',
 'https://www.imdb.com/title/tt1926313/',
 'https://www.imdb.com/title/tt5286444/',
 'https://www.imdb.com/title/tt2980794/',
 'https://www.imdb.com/title/tt8042292/',
 'https://www.imdb.com/title/tt1447500/',
 'https://www.imdb.com/title/tt0106333/',
 'https://www.imdb.com/title/tt2140465/',
 'https://www.imdb.com/title/tt0920464/',
 'https://www.imdb.com/title/tt5310090/',
 'https://www.imdb.com/title/tt7212754/',
 'https://www.imdb.com/title/tt1324059/',
 'https://www.imdb.com/title/tt3767372/',
 'https://www.imdb.com/title/tt2375559/',
 'https://www.imdb.com/title/tt6027478/',
 'https://www.imdb.com/title/tt8590896/',
 'https://www.imdb.com/title/tt0172684/',
 'https://www.imdb.com/title/tt6206564/',
 'https://www.imdb.com/title/tt0449994/']]

現在我必須為該列表中的所有 250 個 URL 執行此操作。 當循環這個過程時,我只得到最后一個 URL 信息。

這是我為 1 個 URL 編寫的代碼,

def get_movie_info(a_tag, div_tag):

  # returns all the required info about a movie
  span_tags1 = a_tag.find_all('span')
  genre=span_tags1[0].text.strip()
  li_tags = div_tag.find_all('li')
  length_of_film=li_tags[1].text.strip()
  return genre, length_of_film 
  movie_page_url = links[0]       #1st url in the list
  response = requests.get(movie_page_url)

  #get a tags
  a_tags = movie_doc.find_all('a', attrs={'class':"GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"})

  #get div tags
  div_tags = movie_doc.find_all('div', attrs={'class':"TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"})

  movie_dict = {
    'genre1' : [],
    'length_of_movie' : []}

  a_tag = a_tags[0]
  div_tag = div_tags[0]

  movie_info = get_movie_info(a_tag,div_tag)
  movie_dict['genre1'].append(movie_info[0])
  movie_dict['length_of_movie'].append(movie_info[1])

輸出是

movie_dict = {'genre1': ['犯罪'], 'length_of_movie': ['2h 25min']}

輸出應該是包含“genre1”和“length_of_movie”列的數據框以及具有相應電影類型和長度的 250 行

使用電影 URL 遍歷您的列表並將結果放入字典值。 最后一步,創建數據框:

import requests
from bs4 import BeautifulSoup

links = [
    "https://www.imdb.com/title/tt0093603/",
    "https://www.imdb.com/title/tt8176054/",
    "https://www.imdb.com/title/tt0367495/",
    # ... rest of your URLs
]


def get_movie_info(a_tag, div_tag):
    span_tags1 = a_tag.find_all("span")
    genre = span_tags1[0].text.strip()
    li_tag = div_tag.find(lambda tag: tag.name == "li" and "min" in tag.text)
    length_of_film = li_tag.text.strip()
    return genre, length_of_film


movie_dict = {"genre1": [], "length_of_movie": []}
for movie_page_url in links:
    response = requests.get(movie_page_url)
    movie_doc = BeautifulSoup(response.content, "html.parser")

    # get a tags
    a_tags = movie_doc.find_all(
        "a",
        attrs={
            "class": "GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"
        },
    )

    # get div tags
    div_tags = movie_doc.find_all(
        "div",
        attrs={
            "class": "TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"
        },
    )

    a_tag = a_tags[0]
    div_tag = div_tags[0]

    movie_info = get_movie_info(a_tag, div_tag)
    movie_dict["genre1"].append(movie_info[0])
    movie_dict["length_of_movie"].append(movie_info[1])

df = pd.DataFrame(movie_dict)
print(df)

印刷:

      genre1 length_of_movie
0      Crime        2h 25min
1      Drama        2h 34min
2  Adventure        2h 40min

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM