import pandas as pd
\
quote from Wed McKinzey book.
Easy Questions
create a dataframe with uniform and normal…
- axis=0 is columns
- `data.isnull().sum() is null generates a boolean dataframe of same shape sum is an aggregate method which sums across the default axis of columns (axis=1) rows are axis=0 results in a pandas series (1 dimensional dataframe)
What is a DataFrame object?
Seeing the actual data is always important. What are some methods to get a quick sight of your raw data?
.head()
.info()
DataFrame Creation: How would you create a DataFrame from a CSV file in Pandas?
The following is the first 5 rows of a dataset alex-boba.csv. Alex visits various boba tea shops, sampling drinks with diverse characteristics such as price, volume, sugar level, and whether they are organic, and subsequently rates each drink on a scale of 1 to 10.
| Price ($) | Volume (ml) | Sugar Level (%) | Organic | Rating (out of 10) |
|---|---|---|---|---|
| 4.74 | 399 | 1.9 | No | 10 |
| 5.58 | 477 | 61.8 | Yes | 4 |
| 5.01 | 543 | 61.2 | No | 1 |
| 4.72 | 585 | 61.7 | Yes | 6 |
| 4.12 | 447 | 94.4 | Yes | 1 |
↳ How would you load in the dataset in pandas, and then see a view of the dataset like in the table above in order to validate it?
boba = pd.read_csv('alex-boba.csv')
boba.head() # 5 by default↳ Alex is a pretty careful guy, but just in case, you want to validate that their are no missing entries in his boba dataset. How can you do this?
boba.info()This displays a summary of the boba dataset. It lists each column, along with the non-null count, and the datatype. The output might look something like this:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Price ($) 20 non-null float64
1 Volume (ml) 20 non-null int64
2 Sugar Level (%) 20 non-null float64
3 Organic 20 non-null object
4 Rating (out of 10) 20 non-null int64
dtypes: float64(2), int64(2), object(1)
memory usage: 928.0+ bytes↳ Read the output above. Are there any nulls in the dataset? What is the datatype of values in the “Organic” field?
- There are no nulls, since every column has a
non-nullcount of20. - The ‘Organic’ field has an ‘object’ datatype, which typically corresponds to strings or mixed types.
- The other variables are
float64andint64. Read here if you don’t know what that means.
↳ What does boba.isnull().sum() do? What would the expected output for this dataset look like, given that there are no null values?
The expression boba.isnull().sum() performs two operations on the boba DataFrame:
-
boba.isnull(): This creates a new DataFrame of the same shape asboba, where each element isTrueif the corresponding element inbobais a missing value (NaNorNone), andFalseotherwise. -
.sum(): This method then sums theseTrue(treated as1) andFalse(treated as0) values for each column. Essentially, it counts the number of missing values (NaNorNone) in each column.
Given that there are no null values in the dataset, the expected output would be a Series where each entry corresponds to a column in the boba DataFrame, and the value for each column would be 0, indicating that there are no missing values in any column. The output would look something like this:
Price ($) 0
Volume (ml) 0
Sugar Level (%) 0
Organic 0
Rating (out of 10) 0
dtype: int64↳ In pandas, what is the default axis for methods such as .sum(). What even is an axis in pandas?
- In Pandas, the default axis for the
.sum()method isaxis=0. - This means that the method will sum the data along the vertical axis, or in other words, it will sum data down rows for each column.
- When applied to a DataFrame, it returns the sum of each column.
Okay, now that the boba dataset is all loaded in, explain how you could replace the values in the boba rating column . It should be bad if the boba rating is less than 6, and good otherwise.
bins = (0, 6.5, 10)
bin_names = ['bad', 'good']
boba['rating'] = pd.cut(boba['rating'], bins=bins, labels=bin_names)↳ How can you confirm that the ratings column has been transformed into only ‘bad’ and ‘good’ values? How can you verify that good is ranked higher than bad?
boba['rating'].unique() This will output:
['bad', 'good']
Categories (2, object): ['bad' < 'good']Showing that the column only contains bad and good values. Pandas is also super helpful, and shows the internal order of the categories (although this is optional). Remember, categorical data can be ordinal, meaning that the categories have an ‘ordering’.
This ordering can be important in certain operations, like sorting or when the categorical data is used in a machine learning model, as it dictates how the categories are interpreted in relation to each other.
Clearly, in this case, it is important that 'bad' < 'good'.
If you need to select a single column from a DataFrame, what syntax would you use?
↳ How would you filter rows in a DataFrame where a specific column’s value is greater than a certain threshold?
↳ What command would you use to check for missing values in a DataFrame?
↳ How can you fill missing values in a DataFrame with the mean of their respective columns?
Fill missing data
↳ What is the command to group data in a DataFrame and calculate the sum of a specific column for each group?
Data aggregation
↳ Can you demonstrate how to perform an inner join between two DataFrames in Pandas?
↳ How would you sort a DataFrame by the values in a particular column?
↳ How can you set a specific column as the index of a DataFrame?
↳ After setting a new index, how can you reset it back to the default integer index?
↳ How do you select rows based on index labels using the .loc accessor?
↳ How do you select rows based on integer index using the .iloc accessor?
↳ What is the syntax for selecting DataFrame rows where a certain condition is met?
↳ How do you find the unique values in a DataFrame column?
↳ What command would you use to count the occurrences of each unique value in a DataFrame column?
↳ Given a Pandas Series, how would you replace all values greater than a specified threshold with a fixed value, and then calculate the mean of this modified Series?
↳ How can you quickly generate descriptive statistics for each column in a DataFrame?
↳ If you need to add a new column to a DataFrame based on values from existing columns, what approach would you take?
↳ Finally, how do you export a DataFrame to a CSV file?
What is .json
JavaScript Object Notation
↳ How is json related to JavaScript?
Not at all
What are some ways to get some basic information about a dataframe?
How can you see each feature in a dataset along with the number of non-null values, and their datatypes
Hard Questions
How can you return a DataFrame containing all duplicated items?
df.duplicated().drop_duplicates()
Walk through the 3 steps in receiving json data from an API, and saving to a file in human readable .json
I have a list of tickets. Every ticket can have any number of ‘tags’ (e.g is_important). List three different ways to represent your tickets in a pandas DataFrame.
- String of comma-separated values
If I knew the number of tags per ticket wasn’t very large, I would choose to store it as space or comma-separated string.
1. Storing Tags as Comma-Separated Strings:
ticket_id tags
1 "tag1, tag2, tag3"
2 "tag2, tag3"
3 "tag1, tag3"
- Boolean Flags for Each Tag
If I knew I was going to perform analysis on individual tags, such as aggregating or using it as input into a machine learning model, I would create a separate column, or boolean flag, for each tag.
Then I would indicate the existence of a tag using True or False. This would result in a very wide DataFrame.
2. Boolean Flags for Each Tag:
ticket_id tag1 tag2 tag3
1 True True True
2 False True True
3 True False True
- Multi-hot Encoded Columns
This is essentially the same idea as boolean flags, but instead of True/False, use 1 and 0. This would be useful for analysis and machine learning applications further down the pipeline.
3. Multi-hot Encoded Columns:
ticket_id tag1 tag2 tag3
1 1 1 1
2 0 1 1
3 1 0 1
- Separate rows for each tag
If the ticket_id didn’t have to be uniquely identifying, then I could simply create a tag column and then use a new row for each new tag assigned to a ticket. In this case, each ticket will appear multiple times in the DataFrame, once for each of its tags.
4. Separate rows for each tag:
ticket_id tag
1 tag1
1 tag2
1 tag3
2 tag2
2 tag3
3 tag1
3 tag3
Explain the difference between using loc and iloc to index a DataFrame.
Indexing
Simply mean searching, splicing, accessing certain rows/columns.
Success
.ilocIndex based using
0, 1, 2, 3, ...like how traditional lists are indexed in Python
Success
.locLabel based using the worded name of the row/column. E.g the column
'passive_income'
Given a DataFrame df, how would you select the first 5 rows and all columns?
df.iloc[0:5, : ]##### ↳ How would you select the first 3 columns and all rows?
df.iloc[ : ,0:3]↳ How would you select the first 3 rows and first 5 columns?
df.iloc[0:5,0:3]↳ How would you select rows named 'a' through 'e', and all columns?
df.loc['a':'e', :]↳ How would you select all columns 'col_1' to 'col_3' and all rows?
df.iloc[ : ,'col_1':'col_3']↳ How would you select all rows named 'a' through 'e' and all columns 'col_1' to 'col_3'?
df.iloc['a':'e','col_1':'col_3']