Pandas Usage — The fundamentally detailed tutorial

KevinLuo
6 min readNov 6, 2022

--

Dear readers and friends, because it is not easy to update the blog, if you think this blog is useful to you, please pay attention, like, give me applaud, and bookmark it, thank you very much.

  1. Generate a data table

1. First import the pandas library, which generally uses numpy libraries, so let’s import it for reserve first:

import numpy as np
import pandas as pd

2. Import CSV or xlsx files:

df = pd.DataFrame(pd.read_csv(‘name.csv’,header=1))
df = pd.DataFrame(pd.read_excel(‘name.xlsx’))

or

import pandas as pd
from collections import namedtuple

Item = namedtuple(‘Item’, ‘reply pv’)
items = []

with codecs.open(‘reply.pv.07’, ‘r’, ‘utf-8’) as f:
for line in f:
line_split = line.strip().split(‘\t’)
items.append(Item(line_split[0].strip(), line_split[1].strip()))

df = pd.DataFrame.from_records(items, columns=[‘reply’, ‘pv’])

3. Create a data table with pandas:

df = pd.DataFrame({“id”:[1001,1002,1003,1004,1005,1006],
“date”:pd.date_range(‘20130102’, periods=6),
“city”:[‘Taipei‘, ‘101’, ‘ guangzhou ‘, ‘Shenzhen’, ‘shanghai’, ‘America ‘],
“age”:[23,44,54,32,34,32],
“category”:[‘100-A’,’100-B’,’110-A’,’110-C’,’210-A’,’130-F’],
“price”:[1200,np.nan,2133,5433,np.nan,4432]},
columns =[‘id’,’date’,’city’,’category’,’age’,’price’])

Second, the data table information view

1. Dimension view:

df.shape

2. Basic information of the data table (dimensions, column names, data format, occupied space, etc.):

df.info()

3. The format of each column of data:

df.dtypes

4. A certain column format:

df[‘B’].dtype

5. Null value:

df.isnull()

6. View the null value of a certain column:

df[‘B’].isnull()

7. View the unique value of a column:

df[‘B’].unique()

8. View the values of the data table:

df.values

9. View column names:

df.columns

10. View the first 5 rows of data and the last 5 rows of data:

df.head(5) #default is five
df.tail(5)

3. Data table cleaning

1. Fill in the null value with the number 0:

df.fillna(value=0)

2. Use the mean of the column prince to fill NA:

df[‘prince’].fillna(df[‘prince’].mean())

3. Clear the character spaces in the city field:

df[‘city’]=df[‘city’].map(str.strip)

4. Case conversion:

df[‘city’]=df[‘city’].str.lower()

5. Change the data format:

df[‘price’].astype(‘int’)

6. Change the column name:

df.rename(columns={‘category’: ‘category-size’})

7. Duplicate values after deletion:

df[‘city’].drop_duplicates()

8. Delete the duplicate values that appear first:

df[‘city’].drop_duplicates(keep=’last’)

9. Data substitution:

df[‘city’].replace(‘sh’, ‘shanghai’)

4. Data preprocessing

df1=pd.DataFrame({“id”:[1001,1002,1003,1004,1005,1006,1007,1008],
“gender”:[‘male’,’female’,’male’,’female’,’male’,’female’,’male’,’female’],
“pay”:[‘Y’,’N’,’Y’,’Y’,’N’,’Y’,’N’,’Y’,],
“m-point”:[10,12,20,40,40,40,30,20]})

1. Data table merging

1.1merge

df_inner=pd.merge(df,df1,how=’inner’)
df_left=pd.merge(df,df1,how=’left’)
df_right=pd.merge(df,df1,how=’right’)
df_outer=pd.merge(df,df1,how=’outer’)

1.2 append

result = df1.append(df2)

1.3 join

result = left.join(right, on=’key’)

1.4 concat

pd.concat(objs, axis=0, join=’outer’, join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)

  • objs: A mapping of a sequence or series, synthesis, or panel object. If passed in the dictionary, the sorted key will be used as the key parameter, unless it is passed, in which case the value will be selected
    (see below). Any objection without any objection will be silently discarded unless none of them will raise a ValueError in this case.
  • axis: {0,1,…}, default value is 0. To connect along the axis.
  • join: {‘inside’, ‘outside’}, default ‘outside’. How to handle indexes on other axis(es). Intersections inside and outside the Union.
  • ignore_index: Boolean, default False. If True, do not concatenate index values used on the axis. The resulting axis will be labeled
    0,…,n-1. This is useful if you concatenate concatenated axes for objects that do not have meaningful index information. Note the index values on other axes that are still respected in the join.
  • join_axes: Index a list of objects. Specific metrics, used for other n-1 axes instead of performing internal/external setup logic. keys:
    sequence, default to none. Build a hierarchical index using the passing key as the outermost level. If multiple levels pass, tuples should be included.
  • levels: A sequence of lists, defaulting to none. Specific levels (unique values) are used to build multiple. Otherwise, they will infer the key.
  • names: list, default to None. The name of the level in the resulting hierarchical index.
  • verify_integrity: Boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive in tandem relative to the actual data.
  • Copy: Boolean, default True. If false, do not, copy data unnecessarily.

Sample code:

1.frames = [df1, df2, df3]
2.result = pd.concat(frames)

2. Set the index column

df_inner.set_index(‘id’)

3. Sort by the value of a specific column:

df_inner.sort_values(by=[‘age’])

4. Sort by index column:

df_inner.sort_index()

5. If the value of the prince column > 3000, the group column shows high, otherwise it shows low:

df_inner[‘group’] = np.where(df_inner[‘price’] > 3000,’high’,’low’)

6. Group and label the data of multiple conditions that are compounded

df_inner.loc[(df_inner[‘city’] == ‘beijing’) & (df_inner[‘price’] >= 4000), ‘sign’]=1

7. Divide the values of the category field in turn, and create a data table, an index column with an index value of df_inner, and the column name is category and size

pd.DataFrame((x.split(‘-’) for x in df_inner[‘category’]),index=df_inner.index,columns=[‘category’,’size’])

8. Match the data table after the split with the original df_inner data table

df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)

<>. Data extraction

The main three functions used: loc, iloc and ix, loc function extracts by label value, iloc extracts by position, and ix can extract by label and position at the same time.

1. Extract the value of a single row by index

df_inner.loc[3]

2. Extract the value of the region row by index

df_inner.iloc[0:5]

3. Reset the index

df_inner.reset_index()

4. Set the date to index
df_inner=df_inner.set_index(‘date’)

5. Extract all data before 4 days
df_inner[:’2013–01–04']

6. Use ILOC to extract data by location area
df_inner.iloc[:3,:2]

7. Adapt to the ILOC to lift the data separately by location
df_inner.iloc[[0,2,5],[4,5]] #extract 0、2、5 rows,4、5 column

8. Use IX to extract data by index label and position
df_inner.ix[:’2013–01–03',:4]

9. Determine whether the value of the city column is Taipei
df_inner[‘city’].isin([‘Taipei’])

10. Determine whether the city column contains Taipei and Taichung, and then extract the data that meets the criteria
df_inner.loc[df_inner[‘city’].isin([‘Taipei’,’Taichung’])]

11. Extract the first three characters and generate a data table
pd.DataFrame(df_inner[‘category’].str[:3])

<>. Data screening
Use the three conditions of greater than, less than, and equal to the three conditions of , or, and non to filter, count, and sum the data.

1. Use “and” to filter
df_inner.loc[(df_inner[‘age’] > 25) & (df_inner[‘city’] == ‘Taipei’), [‘id’,’city’,’age’,’category’,’gender’]]

2. Use “or” to filter
df_inner.loc[(df_inner[‘age’] > 25) | (df_inner[‘city’] == ‘Taipei’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘age’])

3. Use “non” criteria for filtering
df_inner.loc[(df_inner[‘city’] != ‘Taipei’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’])

4. Count the filtered data according to the city column
df_inner.loc[(df_inner[‘city’] != ‘Taipei’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’]).city.count()

5. Use the query function to filter
df_inner.query(‘city == [“Taipei”, “Taichung”]’)

6. Sum the filtered results according to prince
df_inner.query(‘city == [“Taipei”, “Taichung”]’).price.sum()

<>. Data Summary
The main functions are groupby and pivote_table

1. Count and summarize all columns
df_inner.groupby(‘city’).count()

2. Count the id field by city
df_inner.groupby(‘city’)[‘id’].count()

3. Summarize the two fields
df_inner.groupby([‘city’,’size’])[‘id’].count()

4. Summarize the city field and calculate the total and mean of Prince separately
df_inner.groupby(‘city’)[‘price’].agg([len,np.sum, np.mean])

<>. Data statistics
Data sampling, calculation of standard deviation, covariance and correlation coefficients

1. Simple data sampling
df_inner.sample(n=3)

2. Manually set the sampling weight
weights = [0, 0, 0, 0, 0.5, 0.5]
df_inner.sample(n=2, weights=weights)


3. Do not put it back after sampling
df_inner.sample(n=6, replace=False)

4. Put it back after sampling
df_inner.sample(n=6, replace=True)

5. Descriptive statistics of data tables
df_inner.describe().round(2).T

6. Calculate the standard deviation of the column
df_inner[‘price’].std()

7. Calculate the covariance between two fields
df_inner[‘price’].cov(df_inner[‘m-point’])

8. Covariance between all fields in the data table
df_inner.cov()

9. Correlation analysis of two fields
df_inner[‘price’].corr(df_inner[‘m-point’])

10. Correlation analysis of data tables
df_inner.corr()

<>. Data output
The analyzed data can be output in XLSX format and CSV format

1. Write to Excel
df_inner.to_excel(‘excel_to_python.xlsx’, sheet_name=’bluewhale_cc’)

2. Write to CSV
df_inner.to_csv(‘excel_to_python.csv’)

--

--

KevinLuo

知曉很多種資料處理,可BI或AI化的軟體和工具。主要用的程式語言是python和R 偶爾用C++ Ig:(可在上面找到我) AIA第九屆經理人班 立志當個厲害的podcaster!