聚合, 分组, 数据清理 Aggregating, Grouping, Data Cleaning【Pandas入门教程5】

Contents hide

无法播放？请点击这里跳转到Youtube

切换视频源：

结合前四章的学习，相信大家已经掌握了操作DataFrame基本的方法（索引、过滤、排序、数据增删差改）。这一章我们就来学习一下和数据分析相关的内容，首先来学习如何使用常用的聚合（Aggregating）函数和分组（Grouping）对DataFrame进行数据分析，最后再了解一下如何处理DataFrame中的空值。

第一步先导入用来学习的数据集：

import numpy as np
import pandas as pd

csv_data_path = 'https://raw.githubusercontent.com/turingplanet/pandas-intro/main/public-datasets/small_survey_results.csv'
survey_df = pd.read_csv(csv_data_path) # 此数据集包含了StackOverflow上随机的1000份用户调研（列名的具体含义请参考：https://github.com/turingplanet/pandas-intro/blob/main/public-datasets/survey_results_schema.csv）

聚合函数 Aggregating Function

聚合函数（Aggregating function）可以帮助我们将多行的数据整合起来，并计算出特定的统计内容，比如中位数，最小值，最大值等等。

以下是几个简单的聚合函数：

survey_df['ConvertedComp'].count() # 查看有多少行的ConvertedComp列数据不为空
survey_df['ConvertedComp'].median() # 工资的中位数 48000.0
survey_df.median() # 计算出所有包含数字类型的列的中位数
Respondent       32890.0
Age                 29.0
CompTotal        56000.0
ConvertedComp    48000.0
WorkWeekHrs         40.0
dtype: float64

survey_df.describe() # describe函数可以帮助我们拿到更多细致的统计分析(count,mean,std,min,max,percentiles)
        Respondent         Age     CompTotal   ConvertedComp  WorkWeekHrs
count   1000.00000  697.000000  5.170000e+02      517.000000   628.000000
mean   33125.63900   30.810617  3.567064e+06    85067.394584    40.547771
std    18973.33677    9.145613  5.058570e+07   142891.442494    13.103536
min       59.00000   13.000000  1.000000e+00        0.000000     3.000000
25%    16665.25000   24.000000  1.500000e+04    19800.000000    40.000000
50%    32890.00000   29.000000  5.600000e+04    48000.000000    40.000000
75%    50005.25000   36.000000  1.200000e+05    88068.000000    44.000000
max    65633.00000   67.000000  1.120000e+09  1000000.000000   160.000000

value_counts函数用来统计不同数值的分布情况：

survey_df['Hobbyist'].value_counts() # Hobbyist列中不同数值的分布统计
Yes    786
No     213
Name: Hobbyist, dtype: int64

survey_df['OpSys'].value_counts() # 操作系统分布的统计，结果会自动按照数量从大到小排好序
Windows        411
Linux-based    241
MacOS          199
BSD              2
Name: OpSys, dtype: int64

survey_df['OpSys'].value_counts(normalize=True) # 将统计值规范化
Windows        0.481829
Linux-based    0.282532
MacOS          0.233294
BSD            0.002345
Name: OpSys, dtype: float64

分组 Group By

学完了聚合函数，接下来学习一下分组。分组可以将数据按照特定的列值，分到不同特定的组，帮助我们进行更细致化的分析。

使用GroupBy查看更多内容：

survey_df['Country'].value_counts() # 查看来自不同国家的用户数量
United States                           182
India                                   141
Germany                                  55
United Kingdom                           52
Brazil                                   36
                                       ... 
Somalia                                   1
Venezuela, Bolivarian Republic of...      1
Sierra Leone                              1
Maldives                                  1
Peru                                      1
Name: Country, Length: 96, dtype: int64

country_groups = survey_df.groupby(['Country']) # 通过Country给dataframe进行分组
country_groups.get_group('China') # 拿到Country是China的用户信息
country_groups.get_group('India') # 拿到Country是India的用户信息

使用Group By结合aggregating函数对数据集进行分析：

survey_df.groupby(['Country']).count() # 基于Country分组的not null count分析
survey_df.groupby(['Country']).sum() # 基于Country分组的sum统计

使用filter分析特定国家的信息：

country_filter = (survey_df['Country'] == 'China')
survey_df[country_filter]['OpSys'].value_counts() # 查看中国用户的操作系统分布情况

country_filter = (survey_df['Country'] == 'United States')
survey_df[country_filter]['OpSys'].value_counts() # 查看美国用户的操作系统分布情况

survey_df.groupby(['Country'])['OpSys'].value_counts() # 根据不同国家操作系统的分布情况
survey_df.groupby(['Country']).get_group('China')['OpSys'].value_counts() # 查看中国用户的操作系统分布情况

结合gropby获取特定国家的信息：

country_groups['OpSys'].value_counts().head(10) # 操作系统的分布情况，结果是基于不同国家的分布
country_groups['OpSys'].value_counts().loc['China'] # 查看那些来自中国的用户的操作系统分布情况
country_groups['OpSys'].value_counts().loc['India'] # 查看那些来自印度的用户的操作系统分布情况
country_groups['OpSys'].value_counts(normalize=True).loc['United States'] # 查看美国用户的操作系统分布情况（结果规范化）

再使用聚合函数对不同国家的薪资进行分析：

country_groups['ConvertedComp'].median() # 每个国家工资的中位数
country_groups['ConvertedComp'].median().loc['Germany'] # 德国用户工资的中位数

查看多个聚合函数的结果：

country_groups['ConvertedComp'].agg(['median', 'mean']) # 查看每个国家工资的中位数和平均数
country_groups['ConvertedComp'].agg(['median', 'mean']).loc['Japan'] # 查看日本用户工资的中位数和平均数

真实场景应用

接下来应用上面学过的知识点，分析一下每个国家中有多少人用过Python。

country_filter = survey_df['Country'] == 'United States' # 选取那些国家名为美国的数据
survey_df.loc[country_filter]['LanguageWorkedWith'].str.contains('Python') # 获取一个Boolean Series显示出这些用户是否使用过Python
survey_df.loc[country_filter]['LanguageWorkedWith'].str.contains('Python').sum() # 查看其中所有用过Python用户的数量

然后使用groupby查看每个国家中有多少人使用过Python：

country_groups = survey_df.groupby(['Country'])
country_groups['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()) # 查看每个国家有多少人使用过Python

再将多个统计数据集结合，查看更丰富的相关内容：

country_respondents = survey_df['Country'].value_counts() # 查看每个国家有多少人
country_uses_python = country_groups['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()) # 查看每个国家使用过Python的用户数量
concated_df = pd.concat([country_respondents, country_uses_python], axis='columns', sort=False) # 将两个统计数集进行结合 (concat会在下一章讲)
                           Country  LanguageWorkedWith
United States                  182                  87
India                          141                  48
Germany                         55                  22
United Kingdom                  52                  25
Brazil                          36                  11
...                            ...                 ...
Armenia                          1                   0
Kuwait                           1                   0
Congo, Republic of the...        1                   0
Uruguay                          1                   1
Estonia                          1                   1

接下来我们将特定的列重命名，并创建一个新的列来计算使用过Python用户的百分比：

concated_df.rename(columns = {'Country': 'NumOfUsers', 'LanguageWorkedWith': 'NumOfPythonUsers'}, inplace = True) # 重命名列的名字
concated_df['KnowsPython%'] = (concated_df['NumOfPythonUsers'] / concated_df['NumOfUsers']) * 100 # 创建一个新列，然后计算出各国使用Python用户的百分比

接下来使用特定的列将DataFrame进行排列：

concated_df.sort_values(by = 'KnowsPython%', ascending = False) # 根据百分比进行排序
num_filter = concated_df['NumOfPythonUsers'] > 10 # 查看那些Python用户数量大于10个的国家
concated_df.loc[num_filter].sort_values(by = 'NumOfPythonUsers', ascending = False) # 查看用过Python的用户数量百分比

数据清理 Data Cleaning

接下来我们来学习一下数据清理，首先拿到一小部分的数据，其中需要含有空值（NaN）：

small_survey_df = survey_df.loc[:5, ['Respondent', 'Hobbyist', 'Age', 'CompFreq', 'CompTotal']]
   Respondent Hobbyist   Age CompFreq  CompTotal
0       20900      Yes   NaN  Monthly     8000.0
1       28235       No  45.0  Monthly   670000.0
2       26082      Yes  23.0   Yearly    65000.0
3       19890      Yes  61.0      NaN        NaN
4       16393      Yes  25.0   Weekly        NaN
5       46721       No  48.0   Yearly   130000.0

然后使用以下的函数处理空值：

small_survey_df.dropna() # 把含有NaN数值的行都去掉
small_survey_df.dropna(axis = 'index', how = 'any') # dropna默认就是any
small_survey_df.dropna(axis = 'index', how = 'all') # 将所有列都是NaN的行删掉
small_survey_df.dropna(axis = 'columns', how = 'any') # 将包含NaN值的列删除
small_survey_df.dropna(axis = 'columns', how = 'all') # 将全是NaN值的列删除

我们也可以通过 subset 参数，设定特定的删除方案：

small_survey_df.dropna(axis = 'index', how='all', subset=['CompFreq', 'CompTotal']) # 只查看CompFreq和CompTotal列，将这两列都是NaN的行删除
small_survey_df.dropna(axis = 'index', how='any', subset=['CompFreq', 'CompTotal']) # 只要CompFreq和CompTotal任意一列含有NaN，就将相关行删除

我们也可以将常见的空值标记（’NA’，’Missing’）转换成np.nan之后，再进行空值操作：

small_survey_df.replace('No', np.nan, inplace = True) # 将'No'转换成空值
small_survey_df.dropna() # 将原来包含 'No' 数值的行删除

以下是和NaN值相关的其他操作函数：

small_survey_df.isna() # 查看DataFrame上的数据是否为NaN，结果是False或者True
small_survey_df.fillna('Missing') # 将空值全部填为 'Missing'
small_survey_df.fillna(0) # 将空值全部填为 0

处理空值是数据清理中一项重要的工作，另一项非常重要的工作就是处理数据类型的不一致，比如以下的例子：

survey_df['YearsCode'].mean() # 查看YearsCode的平均数，会有Error，因为此列是Object数据类型
survey_df['YearsCode'] = survey_df['YearsCode'].astype(float) # Error: 我们也无法将特定的数值(Less than 1 year)转换成float
survey_df['YearsCode'].unique() # 查看特殊的数值
survey_df['YearsCode'].replace('Less than 1 year', 0, inplace=True) # 将特定的string转换成数字 
survey_df['YearsCode'] = survey_df['YearsCode'].astype(float) # 改变此列的数据类型
survey_df['YearsCode'].mean() # 查看平均值
survey_df['YearsCode'].median() # 查看中位数

综合练习 Homework

最后通过一个综合的练习，将目前所学的知识串联起来。这个题目的描d述很简单：通过WebframeWorkedWith和Country列，找到每个国家中最受欢迎的框架。

首先我们要探索一下WebframeWorkedWith中的数据类型：

survey_df['WebframeWorkedWith'].head() # 都是由 ;（封号）隔开的数据
0                                            NaN
1                                            NaN
2    ASP.NET Core;jQuery;Laravel;React.js;Vue.js
3                                            NaN
4                                          Flask

接下来我们就使用split函数，将此列分成多个column：

framework_df = survey_df['WebframeWorkedWith'].str.split(';', expand = True)
               0           1         2         3       4     5     6     7   \
0            None        None      None      None    None  None  None  None   
1            None        None      None      None    None  None  None  None   
2    ASP.NET Core      jQuery   Laravel  React.js  Vue.js  None  None  None   
3            None        None      None      None    None  None  None  None   
4           Flask        None      None      None    None  None  None  None   
..            ...         ...       ...       ...     ...   ...   ...   ...   
995       Angular  Angular.js  React.js      None    None  None  None  None   
996       Angular     Express    Spring      None    None  None  None  None   
997          None        None      None      None    None  None  None  None   
998          None        None      None      None    None  None  None  None   
999       Angular  Angular.js    jQuery   Laravel    None  None  None  None

然后再通过以下的操作找到所有不同framework的名字，记得要先将NaN转换成string ‘None’：

framework_df.fillna('None', inplace = True)
distinct_frameworks = np.unique(framework_df.values)
distinct_frameworks
array(['ASP.NET', 'ASP.NET Core', 'Angular', 'Angular.js', 'Django',
       'Drupal', 'Express', 'Flask', 'Gatsby', 'Laravel', 'None',
       'React.js', 'Ruby on Rails', 'Spring', 'Symfony', 'Vue.js',
       'jQuery'], dtype=object)

再通过groupby函数将原DataFrame进行分组：

country_groups = survey_df.groupby(['Country']) # 给dataframe按照Country进行分组

然后创建新的列，算出每个国家中各个Webframe的使用人数：

framework_sum_array = []
for framework in distinct_frameworks:
    new_df = country_groups['WebframeWorkedWith'].apply(lambda x: x.str.contains(framework).sum())
    new_df.name = framework
    framework_sum_array.append(new_df)

user_count = survey_df['Country'].value_counts()
concated_df = pd.concat([user_count] + framework_sum_array, axis='columns')
                Country  ASP.NET  ASP.NET Core  Angular  Angular.js  Django  \
United States       182       41            35       48          30      12   
India               141       22            12       41          28      16   
Germany              55        4             3       13           5       3   
United Kingdom       52        6             4        5           3       2   
Brazil               36        3             1        7           2       3   
...                 ...      ...           ...      ...         ...     ...   
Tunisia               1        0             0        0           0       0   
Somalia               1        1             0        0           0       0   
Uzbekistan            1        0             0        0           0       0   
Kazakhstan            1        0             0        0           0       0   
Benin                 1        0             0        0           0       0

最后再使用 idxmax 找出每行中最多人使用的框架名字：

most_popular_df = concated_df.drop(columns = ['Country', 'None']).idxmax(axis = 1) # 找到每行中最多人使用的框架名字
most_popular_df.name = 'most_popular_framwork'
most_popular_df
United States       jQuery
India               jQuery
Germany             jQuery
United Kingdom    React.js
Brazil              jQuery
                    ...   
Tunisia              Flask
Somalia            ASP.NET
Uzbekistan         ASP.NET
Kazakhstan          Spring
Benin              Express
Name: most_popular_framwork, Length: 96, dtype: object

final_df = concated_df.join(most_popular_df)[['Country', 'most_popular_framwork']] # 只查看国家和最受欢迎框架的信息
print(final_df)
                Country most_popular_framwork
United States       182                jQuery
India               141                jQuery
Germany              55                jQuery
United Kingdom       52              React.js
Brazil               36                jQuery
...                 ...                   ...
Tunisia               1                 Flask
Somalia               1               ASP.NET
Uzbekistan            1               ASP.NET
Kazakhstan            1                Spring
Benin                 1               Express

AI时代幸存者的栖息地✨

聚合, 分组, 数据清理 Aggregating, Grouping, Data Cleaning【Pandas入门教程5】

聚合函数 Aggregating Function

分组 Group By

真实场景应用

数据清理 Data Cleaning

综合练习 Homework

Top Sliding Bar

Recent Tweets

Newsletter

聚合函数 Aggregating Function

分组 Group By

真实场景应用

数据清理 Data Cleaning

综合练习 Homework

Top Sliding Bar

Recent Tweets

Newsletter

Discover more from TuringPlanet