Pandas Note

September 25, 2022 1 minute read

set_index()

It replaces the index using the exiting column. Access to the row is available through loc and is accessible through the order.

train_data
train_data.set_index('item_id', drop='false') # if there is drop: then another set_index function drops the previous index columns
train_data.loc[0]

# date              2013-02-01 00:00:00
# date_block_num                      0
# shop_id                            59
# item_id                         22154
# item_price                      999.0
# item_cnt_day                      1.0
# day_of_week                    Friday
# Name: 0, dtype: object

Screen Shot 2022-09-28 at 10.04.45 PM

Reset_index()

It discards the existing index, such as being given after groupby(), and resets the index of the Data Frame. It ranges from 0 to length of data as index.

grouped  = pd.DataFrame(train_data.groupby(['shop_id', 'date_block_num'])['item_cnt_day'].sum().reset_index() )
grouped2 = pd.DataFrame(train_data.groupby(['shop_id', 'date_block_num'])['item_cnt_day'].sum())

value_counts()

This is the function for Pandas series. It returns the series with the specified values.

# 1st Example
dataFrame['item_category_name'].value_counts().plot(kind='bar',color='orange', figsize=(20,8))


# 2nd Example: It finds and counts the unique values.
index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()
# 3.0    2
# 1.0    1
# 2.0    1
# 4.0    1

Filtering DataFrame

Logical Operators

train_data[train_data["shop_id"]==0].head()

Multiple Logical Operators

train_data[ (train_data["shop_id"] == 2) & (train_data["item_cnt_day"]!=1) ].head(60)

ISIN

names_filter = ['John','Catlin','Mike']
df[df.name.isin(names_filter)]

Reference:

https://towardsdatascience.com/8-ways-to-filter-pandas-dataframes-d34ba585c1b8

pd_todatetime()

When css files are imported, it is very hard to read data time object as date format and usually it is just read as string type. Thus, pd_todatetime() helps convey string to Python Date Time format.

Series.dt.strftime()

It is used to convert specified date format. The format is specified by date_format.

sales_per_month = sales_train_df.groupby(sales_train_df['date'].dt.strftime('%B'))['item_cnt_day'].sum()
# group sold item cou

group_by()

It groups the rows with specified values.

sales_per_month = train_data.groupby(train_data['date'].dt.strftime('%B'))['item_cnt_day'].sum()

Adding new column in existing Dataframe

days = ['Monday', 'Monday', 'Tuesday', 'Thursday', 'Friday', 'Wednesday', 'Tuesday']
train_data['day_of_week'] = days # train_data did't have day_of_week column, but now it has days_of_week

Share on

Twitter Facebook LinkedIn

Ji Kim