简体   繁体   English

如何从 Python 中的 CSV 文件中抓取特定数据?

[英]How do I scrape specific data from a CSV file in Python?

Web scraping has been getting the best of me for the past few days.在过去的几天里,Web 抓取一直是我最好的。 I know the very basics of it, but I've got no clue on how to scrape data properly.我知道它的基本知识,但我不知道如何正确地抓取数据。 Here's the issue: I'm supposed to use a csv link (provided in the code below) to find out which of the Scandinavian countries (Denmark, Sweden and Norway) won the most gold medals in Curling, Skating, Skiing and Ice Hockey, all starting from year 2001. I made a very simple code as shown below (only for Curling though, I'll add the other sports later), but it doesn't return anything.这就是问题所在:我应该使用 csv 链接(在下面的代码中提供)来找出斯堪的纳维亚国家(丹麦、瑞典和挪威)在冰壶、滑冰、滑雪和冰球比赛中获得的金牌最多,一切都从 2001 年开始。我编写了一个非常简单的代码,如下所示(仅适用于冰壶,稍后我将添加其他运动),但它没有返回任何内容。 I have no idea of what I'm missing, so I would truly appreciate any kind of input on this matter.我不知道我错过了什么,所以我真的很感激任何关于这个问题的意见。

import requests
from bs4 import BeautifulSoup as bs
import operator
from collections import Counter
import re

url = "https://sites.google.com/site/dr2fundamentospython/arquivos/Winter_Olympics_Medals.csv"

csv = requests.get(url).text

lines = csv.splitlines()

for l in range(1, len(lines)):
  columns = lines[l].split(',')
  #print(columns)
  medalsSweden = 0
  medalsNorway = 0
  medalsDenmark = 0
  if columns[0] > '2001' and columns[4] == 'Curling' and columns[5] == 'NOR' and columns[7] == 'Gold':
    medalsNorway += medalsNorway + 1
    print(medalsNorway)
  else:
    if columns[0] > '2001' and columns[4] == 'Curling' and columns[5] == 'SWE' and columns[7] == 'Gold':
      medalsSweden += medalsSweden + 1
      print(medalsSweden)
    else:
      if columns[0] > '2001' and columns[4] == 'Curling' and columns[5] == 'DEN' and columns[7] == 'Gold':
        medalsDenmark += medalsDenmark + 1
        print(medalsDenmark)

The columns you're checking don't match up.您检查的列不匹配。 Looking at the columns in lines :查看行中的lines

>>> lines[0]
'Year,City,Sport,Discipline,NOC,Event,Event gender,Medal'

The "sport" is columns[2] and country is columns[4] . “运动”是columns[2] ,国家是columns[4] (You're checking 4 & 5 instead.) Also, you were checking against "discipline" which is specific within the sport. (您正在检查 4 和 5。)此外,您正在检查运动中特定的“纪律”。 For example, "Skiing" is the sport and "Alpine Skiing" is a discipline within it.例如,“滑雪”是一项运动,“高山滑雪”是其中的一门学科。 So each if statement should be changed to:因此,每个if语句都应更改为:

... and columns[2] == 'Curling' and columns[4] == 'NOR' ...

Other improvements:其他改进:

  1. when dealing with numbers like 'Year', convert those to int s rather than doing string comparison:在处理像'Year'这样的数字时,将它们转换为int s,而不是进行字符串比较:

     if int(columns[0]) > 2001 and...

    In this case, you won't have an issue with it since they're 4-digit years and you're comparing with '2001' , a 4-digit string, so the string comparison works out.在这种情况下,您不会有任何问题,因为它们是 4 位数年份,并且您正在与'2001' (一个 4 位数字符串)进行比较,因此字符串比较有效。

  2. Use the CSV module to parse CSV files使用CSV 模块解析 CSV 文件

  3. Re-structure your if statements so that they're not repetitive.重新构造你的if语句,使它们不重复。 Since you're only interested in 2001 onwards and Gold medals, it should be a top-level if .由于您只对2001年以后的金牌和Gold感兴趣,因此它应该是顶级if Others checks, if any, should be nested under it.其他检查,如果有的话,应该嵌套在它下面。

You can handle that within pandas.read_csv() function easily:您可以在pandas.read_csv() function 中轻松处理:

import pandas as pd


df = pd.read_csv(
    "https://sites.google.com/site/dr2fundamentospython/arquivos/Winter_Olympics_Medals.csv")

filt = (df['Year'] > 2001) & (df['Discipline'] == 'Curling') & (
    df['NOC'] == 'NOR') & (df['Medal'] == 'Gold')

print(df[filt])
      Year            City    Sport  ...    Event Event gender Medal
1972  2002  Salt Lake City  Curling  ...  curling            M  Gold        

[1 rows x 8 columns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何计算csv文件中的特定数据并在python中打印该数字? - How do I count specific data from a csv file and print that number in python? 如何在python中以数字方式对csv文件中的数据进行排序 - how do i sort data from a csv file numerically in python 如何从 python 中的 highcharts 图中抓取数据? - How do I scrape data from a highcharts graph in python? 如何使用Python从html表中通过Web抓取数据并将其存储在csv文件中。 我可以提取某些部分,但不能提取其他部分 - How to web scrape data using Python from an html table and store it in a csv file. I am able to extract some parts but not the others 试图从Python抓取数据中制作一个CSV文件 - trying to make a csv file from scrape data from Python 如何从 Python 中的 a.csv 文件中仅读取特定列和特定行? - How do I read only a specific column and a specific row from a .csv file in Python? 我想使用 python 以正确的格式将数据刮到 csv 文件中 - I want to scrape data into csv file with proper format using python 使用python和beautifulsoup从网站数据抓取到csv文件格式 - Data Scrape from a website to a csv file format using python and beautifulsoup 编写python脚本以抓取excel数据并写入CSV,如何获得正确的输出? - Writing a python script to scrape excel data and write to a CSV, how do I get the proper output? 如何从多个 url 中抓取数据并将这些数据保存在同一个 csv 文件中? - How can I scrape data from multiple urls and save these data in the same csv file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM