Pandas
framework
python
data science
Pandas is a python library for data wrangling and data cleaning.
- Basic concept
- Each observation is in its own row (and preferably has an index-value). Each feature/variable is in its own column and has a unique name.
- Data Frames
-
= pd.DataFrame(data=numpy_array, index=[0,1,2], columns=["col1", "col2", "col3"]) my_df
Reshaping data
- From multiple columns to two columns
-
Col1 has former column names, col2 has values
pd.melt(my_df)
- Spread rows into columns
-
Col1’s values are used as column names, col2’s values are used in respective column’s values.
='col1', values= my_df.pivot(columns'col2')
- Append one dataframe to another
-
# below / vertically pd.concat([my_df1, my_df2]) =1) # to the right / horizontally pd.concat([my_df1, my_df2], axis
- Reset the index to row numbers
-
my_df.reset_index()
Slice DataFrames
- First and last rows
-
=10) my_df.head(n=10) my_df.tail(n
- Sample several rows
-
=10) my_df.sample(n
- Select specific rows
-
10:20] my_df.iloc[
- Select specific columns
-
"col1"] my_df["col1", "col2"]] my_df[[
- Select rows based on condition
-
'col1'] > 42] my_df[my_df[
- Select rows based on a list of values in a column
-
'col1'].isin(['value1', 'value2'])] my_df[my_df[
- Remove duplicate rows
-
my_df.drop_duplicates()
- Select rows excluding certain values in a column
-
~my_df['col1'].isin(['value1', 'value2'])] my_df[
Data Manipulation
- Replace values in a column
-
'col1'] = my_df['col1'].replace(['old_value1', 'old_value2'], ['new_value1', 'new_value2']) my_df[
Data Aggregation
- Group by a column and calculate mean of another column
-
'col1')['col2'].mean() my_df.groupby(
Data Visualization
- Plot a histogram of a column
-
'col1'].plot.hist() my_df[
- Plot a scatter plot of two columns
-
='col1', y='col2') my_df.plot.scatter(x