Chapter 5 Parsing data

Our learning objectives in this session are to understand how to access information using indexing for position and logical expressions.

5.0.1 Indexing

Finding information, or sub-setting, is called indexing. Every item in a data frame has a position and some have names.

We select positions using [] notation after a pandas data frame.

Recall that to get a specific column, we can call it by name:

# A DataFrame, ss defined previously
foo_df
# Using []
foo_df['tissue']
## 0        Liver
## 1        Brain
## 2       Testes
## 3       Muscle
## 4    Intestine
## 5        Heart
## Name: tissue, dtype: object
# using . notation
foo_df.tissue
## 0        Liver
## 1        Brain
## 2       Testes
## 3       Muscle
## 4    Intestine
## 5        Heart
## Name: tissue, dtype: object

We can also select items using using position. e.g. we can index rows by index position with .iloc[]:

foo_df

# First row, as a Series
foo_df.iloc[0] 

# First row, as a DataFrame
foo_df.iloc[[0]] 
# a list of integers, the first two rows
foo_df.iloc[[0, 1]] 
##    healthy tissue  quantity
## 0     True  Liver        13
## 1    False  Brain        88

But more explicitly, we can use [ <rows> , <columns> ] notation. In this case we must also use : notation to specify ranges, even if we want all rows or columns.

# To get all columns, use : after the comma
foo_df.iloc[0, :] 
# a list of integers, the first two rows
foo_df.iloc[[0, 1], :]
##    healthy tissue  quantity
## 0     True  Liver        13
## 1    False  Brain        88
# The first two columns, all rows
foo_df.iloc[:,:2]
##    healthy     tissue
## 0     True      Liver
## 1    False      Brain
## 2    False     Testes
## 3     True     Muscle
## 4     True  Intestine
## 5    False      Heart
# A single column, all rows
foo_df.iloc[:,1:2]
##       tissue
## 0      Liver
## 1      Brain
## 2     Testes
## 3     Muscle
## 4  Intestine
## 5      Heart

These simple examples show us some important concepts in indexing:

  • Indexing in Python begins at 0!
  • If no , is present then we retrieve rows
  • We specify a range using the : operator, as per start:end. If using , this must be included
  • The end position is exclusive, i.e. not included in the series!
  • If no start or end position is given, then we take the beginning to the end, respectively.
  • Using a negative number, i.e. -1 begins counting in the reverse direction.

Exercise 5.1 Using foo_df, what commands would I use to get:

  • The 2nd to 3rd rows?
  • The last 2 rows?
  • A random row in foo_df?
  • From the 4th to the last row? But without hard-coding, i.e. regardless of how many rows my data frame contains

Exercise 5.2 List all the possible objects we can use inside iloc[]

e.g. When can we use: - Integers? - Floats? - Characters? - A heterogenous list? - A homogenous list?

Although we typically just use seq[start:end], the complete notation is seq[start:end:step]. The step operation tells how at what interval to sample from the data set. We can reverse the sequence using [::-1]

range(10)[::2]
## range(0, 10, 2)
Exercise 5.3 (Indexing at intervals) Use indexing to obtain all the odd and even rows only from the foo_df data frame.

5.1 Logical Expressions

So far, so good! we saw that we can find information by name and by position. But the real power comes in using logical expressions!

Logical expressions simply mean asking and combining “yes/no” questions. If you think about it, all that computers understand are Yes/No questions. At the end of the day, all computations eventually boil down to “yes/no” questions.

5.1.1 Relational Operators

Relational operators ask “yes/no” questions. In place of yes and no, Python uses type boolean, True/False or 1/0, to provide a positive or negative answer. There are really only 6 kinds of “yes/no” questions. Relational operators are listed in table 5.1:

Table 5.1: A summary of relational operators.
Operator Description
< Less than
<= Less than or equal to
> Greater than
>= Greater than or equal to
== Exactly equal to
!= Not equal to, i.e. the opposite of ==
~x Not x (logical negation)

5.1.2 Logical Operators

A collection of logical expressions can be combined using the following logical operators.

Table 5.2: A summary of logical operators in Python. x and y are logical vectors, e.g. the output from relational operators.
Operator Description
x | y x OR y
x & y x AND y

5.1.3 Conditional sub-setting


foo_df[foo_df.quantity == 233]
##    healthy     tissue  quantity
## 4     True  Intestine       233
foo_df[(foo_df.tissue == "Heart") | (foo_df.quantity == 233)]
##    healthy     tissue  quantity
## 4     True  Intestine       233
## 5    False      Heart        18

5.2 Exercises for Parsing data

For the following exercises, find all rows in foo_df that contain:

Exercise 5.4 Subset for boolean data:

  • Only “healthy” samples.
  • Only “unhealthy” samples.

Exercise 5.5 Subset for numerical data:

  • Only low quantity samples, those below 100.
  • Quantity between 100 and 1000,
  • Quantity below 100 and beyond 1000.

Exercise 5.6 Subset for strings:

  • Only “heart” samples.
  • “Heart” and “liver” samples
  • Everything except “intestines”

5.3 Other common operators

There are some other typical operators that you’ll encounter, but we won’t go into more detail here.

5.3.1 Using in / not in

Operator Meaning |
in | True if value/variable is found in the sequence
not in True if value/variable is not found in the sequence
cities = ['Munich', 'Paris', 'Amsterdam', 'Madrid', 'Istanbul']
dist = [584, 1054, 653, 2301, 2191]

'Paris' in cities
## True
'Tehran' in cities
## False
'Paris' not in cities
## False
'Tehran' not in cities
## True

5.3.2 Using and / or

The evaluation using the and and or operators follow these rules:

  • and and or evaluates expression from left to right.
  • With and, if all values are True, returns the last evaluated value. If any value is False, returns the first one.
  • With or, we return the first True value. If all are False, returns the last value
Operator Meaning
x and y | Re Returns x if x is False, y otherwise
x or y | Ret Returns y if x is False, x otherwise
'Paris' in cities and 'Tehran' not in cities
## True
'Paris' in cities and 'Tehran' in cities
## False
'Paris' in cities or 'Tehran' not in cities
## True
'Paris' in cities or 'Tehran' in cities
## True

5.4 Wrap-up

In this chapter we added to our data science knowledge by understanding how to parse information in a data frame. We can use [] to parse according to name, position or Boolean list.