简体   繁体   中英

How to get specific information from a csv file using pandas?

I want to extract specific information from this csv file .

I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.

I have tried to use

data = pandas.read_csv('Weather_2012.csv')
data.nsmallest(5, 'Visibility(km)')

but it returns several values for the same day. I don't know if I'm doing it correctly, since I need the list of days.

I also need the total number of foggy days. I have filtered all rows containing fog:

data.loc[data['Weather'].str.contains('Fog')]['Weather']

but I don't know how to count the number of days, I can only count the number of rows.

You're looking for DataFrame.resample . Based on a specific column, it will group the rows of the dataframe by a specific time interval.

First you need to do this, if you haven't already:

data['Date/Time'] = pd.to_datetime(data['Date/Time'])

Get the lowest 5 days of visibility:

>>> df.resample(rule='D', on='Date/Time')['Visibility (km)'].mean().nsmallest(5)
Date/Time
2012-03-01    2.791667
2012-03-14    5.350000
2012-12-27    6.104167
2012-01-17    6.433333
2012-02-01    6.795833
Name: Visibility (km), dtype: float64

Basically what that does is this:

  1. Groups all the rows by day
  2. Converts each group to the average value of all the Visibility (km) items for that day
  3. Returns the 5 smallest

Count the number of foggy days

>>> df.resample(rule='D', on='Date/Time').apply(lambda x: x['Weather'].str.contains('Fog').any()).sum()
78

Basically what that does is this:

  1. Groups all the rows by day
  2. For each day, adds a True if any row inside that day contains 'Fog' in the Weather column, False otherwise
  3. Counts how many True 's there were, and thus the number of foggy days.

This will get you an array of all unique foggy days. you can use the shape method to get its dimension

 df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()

I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.

Since your Date/Time column represents a particular hour , you'll need to do some grouping to get the minimum visibility for a particular day . The following will find the 5 least-visible days.

# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date

# Group on the new "Date" column and get the minimum values of
# each column for each group.
>>> min_by_day = data.groupby("Date").min()

# Now we can use nsmallest, since 1 row == 1 day in min_by_day.
# Since `nsmallest` returns a pandas.Series with "Date" as the index,
# we have to use `.index` to pull the date objects from the result.
>>> least_visible_days = min_by_day.nsmallest(5, "Visibility (km)").index

Then you can limit your original dataset to the least-visible days with

data[data["Date"].isin(least_visible_days)]

I also need the total number of foggy days.

We can use the extracted date in this case too:

# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date

# Filter on hours which have foggy weather
>>> foggy = data[data["Weather"].str.contains("Fog")]

# Count number of unique days
>>> len(foggy["Date"].unique())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM