6 Parsing data
Our learning objectives in this section:
- Access information using indexing by position and logical expressions
6.0.1 Indexing
Finding information, or sub-setting, is called indexing. Every item in a data frame has a position and some have names.
We select positions using []
notation after a pandas data frame.
Recall that to get a specific column, we can call it by name:
# A DataFrame, as defined previously
foo_df
# Using []
'tissue']
foo_df[#> 0 Liver
#> 1 Brain
#> 2 Testes
#> 3 Muscle
#> 4 Intestine
#> 5 Heart
#> Name: tissue, dtype: object
# using . notation
foo_df.tissue#> 0 Liver
#> 1 Brain
#> 2 Testes
#> 3 Muscle
#> 4 Intestine
#> 5 Heart
#> Name: tissue, dtype: object
We can also select items using using position. e.g. we can index rows by index position with .iloc[]
:
foo_df
# First row, as a Series
0]
foo_df.iloc[
# First row, as a DataFrame
0]] foo_df.iloc[[
# a list of integers, the first two rows
0, 1]]
foo_df.iloc[[#> healthy tissue quantity
#> 0 True Liver 13
#> 1 False Brain 88
But more explicitly, we can use [ <rows> , <columns> ]
notation. In this case we must also use :
notation to specify ranges, even if we want all rows or columns.
# To get all columns, use : after the comma
0, :] foo_df.iloc[
# a list of integers, the first two rows
0, 1], :]
foo_df.iloc[[#> healthy tissue quantity
#> 0 True Liver 13
#> 1 False Brain 88
# The first two columns, all rows
2]
foo_df.iloc[:,:#> healthy tissue
#> 0 True Liver
#> 1 False Brain
#> 2 False Testes
#> 3 True Muscle
#> 4 True Intestine
#> 5 False Heart
# A single column, all rows
1:2]
foo_df.iloc[:,#> tissue
#> 0 Liver
#> 1 Brain
#> 2 Testes
#> 3 Muscle
#> 4 Intestine
#> 5 Heart
These simple examples show us some important concepts in indexing:
- Indexing in Python begins at 0!
- If no
,
is present then we retrieve rows - We specify a range using the
:
operator, as perstart:end
. If using,
this must be included - The end position is exclusive, i.e. not included in the series!
- If no
start
orend
position is given, then we take the beginning to the end, respectively. - Using a negative number, i.e.
-1
begins counting in the reverse direction.
Exercise 6.1 Using foo_df
, what commands would I use to get:
- The 2nd to 3rd rows?
- The last 2 rows?
- A random row in
foo_df
? - From the 4th to the last row? But without hard-coding, i.e. regardless of how many rows my data frame contains
Exercise 6.2 List all the possible objects we can use inside iloc[]
e.g. When can we use: - Integers? - Floats? - Characters? - A heterogenous list? - A homogenous list?
Although we typically just use seq[start:end]
, the complete notation is seq[start:end:step]
. The step operation tells how at what interval to sample from the data set. We can reverse the sequence using [::-1]
range(10)[::2]
#> range(0, 10, 2)
Exercise 6.3 (Indexing at intervals) Use indexing to obtain all the odd and even rows only from the foo_df data frame.
6.1 Logical Expressions
So far, so good! we saw that we can find information by name and by position. But the real power comes in using logical expressions!
Logical expressions simply mean asking and combining “yes/no” questions. If you think about it, all that computers understand are Yes/No questions. At the end of the day, all computations eventually boil down to “yes/no” questions.
6.1.1 Relational Operators
Relational operators ask “yes/no” questions. In place of yes and no, Python uses type boolean
, True/False
or 1/0
, to provide a positive or negative answer. There are really only 6 kinds of “yes/no” questions. Relational operators are listed in table 6.1:
Operator | Description |
---|---|
< |
Less than |
<= |
Less than or equal to |
> |
Greater than |
>= |
Greater than or equal to |
== |
Exactly equal to |
!= |
Not equal to, i.e. the opposite of ==
|
~x |
Not x (logical negation) |
6.2 Exercises for Parsing data
For the following exercises, find all rows in foo_df
that contain:
Exercise 6.4 Subset for boolean data:
- Only “healthy” samples.
- Only “unhealthy” samples.
Exercise 6.5 Subset for numerical data:
- Only low quantity samples, those below 100.
- Quantity between 100 and 1000,
- Quantity below 100 and beyond 1000.
Exercise 6.6 Subset for strings:
- Only “heart” samples.
- “Heart” and “liver” samples
- Everything except “intestines”
6.3 Other common operators
There are some other typical operators that you’ll encounter, but we won’t go into more detail here.
6.3.1 Using in / not in
Operator | Meaning |
---|---|
in |
True if value/variable is found in the sequence |
not in |
True if value/variable is not found in the sequence |
= ['Munich', 'Paris', 'Amsterdam', 'Madrid', 'Istanbul']
cities = [584, 1054, 653, 2301, 2191]
dist
'Paris' in cities
#> True
'Tehran' in cities
#> False
'Paris' not in cities
#> False
'Tehran' not in cities
#> True
6.3.2 Using and / or
The evaluation using the and
and or
operators follow these rules:
-
and
andor
evaluates expression from left to right. - With
and
, if all values areTrue
, returns the last evaluated value. If any value isFalse
, returns the first one. - With
or
, we return the firstTrue
value. If all areFalse
, returns the last value
Operator | Meaning |
---|---|
x and y |
Returns x if x is False , y otherwise |
x or y |
Returns y if x is False , x otherwise |
'Paris' in cities and 'Tehran' not in cities
#> True
'Paris' in cities and 'Tehran' in cities
#> False
'Paris' in cities or 'Tehran' not in cities
#> True
'Paris' in cities or 'Tehran' in cities
#> True