3.DataFrame Indexing, Filtering【Pandas入门教程3】

Contents hide

无法播放？请点击这里跳转到Youtube

切换视频源：

通过上一章的学习，相信大家已经对Pandas中的Series和DataFrame有一个清楚的了解了。今天我们就来学习如何对DataFrame进行索引（indexing）操作和过滤（flitering）操作。

这章用于学习的DataFrame我已经传到GitHub上了，大家可以通过以下的代码导入：

import numpy as np
import pandas as pd

csv_data_path = 'https://raw.githubusercontent.com/turingplanet/pandas-intro/main/public-datasets/country.csv'
country_info = pd.read_csv(csv_data_path)

索引 Indexing

首先我们来学习一下索引。索引就是定位DataFrame中的特定元素，比如特定的行或者列。以下是基本的索引操作：

country_info['Country'] # 获取特定列 Country 的内容
country_info[1:3] # 获取第2行到第3行的records
country_info.iloc[0] # 获取第一行的内容
country_info.loc[0] # 获取 index 为 0 的 record

因为DataFrame的index默认是0到n-1，所以上面的 iloc[0] 和 loc[0] 的结果是一样的。如果我们要重置index，可以使用 set_index 函数：

country_info.set_index('Country') # 将 Country 列变为 index
country_info.index # RangeIndex(start=0, stop=227, step=1)

可以看到虽然使用了set_index，但是原来DataFrame的index并没有改变，若要改变原来的index，我们需要将 inplace 参数设为 True：

country_info.set_index('Country', inplace = True) # 加了 inplace 之后，country_info 的真正index才会被重设
country_info.index # Index(['Afghanistan', 'Albania', ...

这个时候我们就能使用新的 index 来定位元素了：

country_info.loc[0] # 会返回 KeyError，因为此时 index 变成国家名了
country_info.loc['China'] # 返回中国的所有信息
country_info.loc['China', ['Industry', 'Climate']] # 只返回中国的 Industry 和 Climate 信息
country_info.iloc[0] # iloc依然可用，通过行数来定位数据

改变 index 之后，我们就能充分利用index将数据进行重新排列了：

country_info.sort_index() # 将数据按照国家名正序排列（从A排到Z）
country_info.sort_index(ascending = False) # 将数据按照国家名倒序排列（从Z到A）

此时我们也能利用 index 删除特定的元素：

country_info.drop(['Zimbabwe', 'Yemen'], axis = 0) # 将国家名为Zimbabwe和Yemen的数据删除掉，axis = 0 代表根据行来定位数据
country_info.drop(['Region', 'Population'], axis = 1) # 将Region和Population两列的数据全部删除，axis = 1 代表根据列来定位数据

如果我们要将 index 重置为 0 到 n-1，可以使用 reset_index 函数。重设之后，原来的国家名index会变成新的一列：

country_info.reset_index(inplace = True)
country_info.columns # Index(['Country', 'Region', 'Population', ...

过滤 Filtering

讲完了 Indexing，接下来我们来聊一聊过滤（Flitering），过滤就是通过一些特定的条件筛选出我们想要的数据：

country_info['Net migration'] == '0' # 返回一个Boolean Series，告知我们哪些国家的净迁移率（Net migration）等于'0'
zero_migration_filter = (country_info['Net migration'] == '0')
country_info[zero_migration_filter] # 获得所有0净迁移率国家的数据
country_info.loc[zero_migration_filter, ['Region', 'Population']] # 只查看0净迁移率国家的Region和Populatino列

low_migration_filter = (country_info['Net migration'] < '100')
country_info.loc[low_migration_filter, ['Region', 'Population']] # 查看净迁移率小于100的国家信息

我们也可以使用多个条件用于过滤：

and_filter = (country_info['Deathrate'] > '1000') & (country_info['Population'] < 10000) # AND条件，同时满足Deathrate大于'1000'，人口小于 10000
country_info.loc[and_filter, 'Region'] # 获取满足AND条件国家的Region信息
country_info.loc[(country_info['Deathrate'] > '1000') & (country_info['Population'] < 10000), 'Region'] # 和以上相同


or_filter = (country_info['Deathrate'] > '1000') | (country_info['Population'] < 10000) # OR条件，满足Deathrate大于'1000'或者'Population'小于10000中任意一个条件，就为True
country_info.loc[or_filter, 'Region'] # 获取满足OR条件国家的Region信息

我们也能使用符号~来代表NOT，如下表示，获取任何Population小于等于10000的国家信息：

population_filter = (country_info['Population'] > 10000)
country_info.loc[~population_filter, ['Region', 'Deathrate', 'Population']]

我们也可以通过数组来限定查找的范围：

countries = ['China', 'Japan', 'United States', 'India']
in_filter = country_info['Country'].isin(countries) # 获得国家名是否是在List中的Boolean Series
country_info.loc[in_filter, ['Country', 'Region']] # 获得特定国家的信息

Pandas也支持使用正则表达式来过滤数据：

str_filter = country_info['Country'].str.contains('A')
country_info.loc[~str_filter, 'Country'] # 获得所有国家名字中不包含字母A的国家名

str_filter2 = country_info['Country'].str.contains('A|Z')
country_info.loc[~str_filter2, 'Country'] # 获得所有国家名字中既不包含A也不包含Z的国家名

str_filter3 = country_info['Country'].str.contains('[a-m]')
country_info.loc[~str_filter3, 'Country'] # 获得国家名中不包含小写字母a到m的国家名

恭喜大家，以上我们就完成了和索引和过滤相关的基础学习，最后通过以下这个作业练练手吧：

Homework

请过滤以上的数据集，只获取那些Population大于’10000’并且Climate大于’3’的国家名：

filter1 = (country_info['Population'] > 100000) & (country_info['Climate'] > '3')
country_info.loc[filter1, ['Country', 'Population', 'Climate']]

编程爱好者的栖息地

3.DataFrame Indexing, Filtering【Pandas入门教程3】

索引 Indexing

过滤 Filtering

Homework

Top Sliding Bar

Recent Tweets

Newsletter

索引 Indexing

过滤 Filtering

Homework

Top Sliding Bar

Recent Tweets

Newsletter

Discover more from TuringPlanet