《Pandas Cookbook》第02章 DataFrame基本操作

  1. 云栖社区>
  2. 博客>
  3. 正文

《Pandas Cookbook》第02章 DataFrame基本操作

seancheney 2018-09-24 19:58:00 浏览838
展开阅读全文

第01章 Pandas基础
第02章 DataFrame运算
第03章 数据分析入门
第04章 选取数据子集
第05章 布尔索引
第06章 索引对齐
第07章 分组聚合、过滤、转换
第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化


In[1]: import pandas as pd
       import numpy as np
       pd.options.display.max_columns = 40

1. 选取多个DataFrame列

# 用列表选取多个列
 In[2]: movie = pd.read_csv('data/movie.csv')
        movie_actor_director = movie[['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']]
        movie_actor_director.head()
Out[2]: 
img_52ad200e151bd41e084aa4933e98116e.png
# 选取单列
 In[3]: movie[['director_name']].head()
Out[3]: 
img_a5402696674b19c3f9b195b901cf32c4.png
# 错误的选取多列的方式
 In[4]: movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-954222273e42> in <module>()
----> 1 movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

更多

# 将列表赋值给一个变量,便于多选
 In[6]: cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
        movie_actor_director = movie[cols]
Out[6]: float64    13
        int64       3
        object     11
        dtype: int64
# 使用select_dtypes(),选取整数列
 In[7]: movie.select_dtypes(include=['int']).head()
Out[7]: 
img_1c82728b8e5edab94a60dd683dce4ac0.png
# 选取所有的数值列
 In[8]: movie.select_dtypes(include=['number']).head()
Out[8]: 
img_dc77f22622686920304fe58a4c5cb548.png
# 通过filter()函数过滤选取多列
 In[9]: movie.filter(like='facebook').head()
Out[9]: 
img_4e6c7d5feb22efe16b48a6623936f95c.png
# 通过正则表达式选取多列
 In[10]: movie.filter(regex='\d').head()
Out[10]: 
img_06ed46fa0b4d0380f24c4440c5d77768.png
# filter()函数,传递列表到参数items,选取多列
 In[11]: movie.filter(items=['actor_1_name', 'asdf']).head()
Out[11]: 
img_9038aa4856714ffc9da24972fcedf137.png

2. 对列名进行排序

# 读取movie数据集
 In[12]: movie = pd.read_csv('data/movie.csv')
 In[13]: movie.head()
Out[13]: 
img_544660cb0450760b3c259a072265bfbf.png
# 打印列索引
 In[14]: movie.columns
Out[14]: Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')
# 将列索引按照指定的顺序排列
 In[15]: disc_core = ['movie_title','title_year', 'content_rating','genres']
         disc_people = ['director_name','actor_1_name', 'actor_2_name','actor_3_name']
         disc_other = ['color','country','language','plot_keywords','movie_imdb_link']
         cont_fb = ['director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes',
                    'actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']
         cont_finance = ['budget','gross']
         cont_num_reviews = ['num_voted_users','num_user_for_reviews', 'num_critic_for_reviews']
         cont_other = ['imdb_score','duration', 'aspect_ratio', 'facenumber_in_poster']

 In[16]: new_col_order = disc_core + disc_people + disc_other + \
                    cont_fb + cont_finance + cont_num_reviews + cont_other
         set(movie.columns) == set(new_col_order)
Out[16]: True

 In[17]: movie2 = movie[new_col_order]
         movie2.head()
Out[17]: 
img_493512693c50813db83ea3189e0833de.png

3. 在整个DataFrame上操作

 In[18]: pd.options.display.max_rows = 8
         movie = pd.read_csv('data/movie.csv')
         # 打印行数和列数
         movie.shape 
Out[18]: (4916, 28)
# 打印数据的个数
 In[19]: movie.size
Out[19]: 137648
# 该数据集的维度
 In[20]: movie.ndim
Out[20]: 2
# 该数据集的长度
 In[21]: len(movie)
Out[21]: 4916
# 各个列的值的个数
 In[22]: movie.count()
Out[22]: color                     4897
         director_name             4814
         num_critic_for_reviews    4867
         duration                  4901
                                   ... 
         actor_2_facebook_likes    4903
         imdb_score                4916
         aspect_ratio              4590
         movie_facebook_likes      4916
         Length: 28, dtype: int64
# 各列的最小值
 In[23]: movie.min()
Out[23]: num_critic_for_reviews     1.00
         duration                   7.00
         director_facebook_likes    0.00
         actor_3_facebook_likes     0.00
                                    ... 
         actor_2_facebook_likes     0.00
         imdb_score                 1.60
         aspect_ratio               1.18
         movie_facebook_likes       0.00
         Length: 16, dtype: float64
# 打印描述信息
 In[24]: movie.describe()
Out[24]: 
img_610557ced5ec5d54feeeb34ab5bd711b.png
# 使用percentiles参数指定分位数
 In[25]: pd.options.display.max_rows = 10
 In[26]: movie.describe(percentiles=[.01, .3, .99])
Out[26]: 
img_d1cca650bbf8eadb2a42c5114f2bb905.png
# 打印各列空值的个数
 In[27]: pd.options.display.max_rows = 8
 In[28]: movie.isnull().sum()
Out[28]: color                      19
         director_name             102
         num_critic_for_reviews     49
         duration                   15
                                   ... 
         actor_2_facebook_likes     13
         imdb_score                  0
         aspect_ratio              326
         movie_facebook_likes        0
         Length: 28, dtype: int64

更多

# 设定skipna=False,没有缺失值的数值列才会计算结果
 In[29]: movie.min(skipna=False)
Out[29]: num_critic_for_reviews     NaN
         duration                   NaN
         director_facebook_likes    NaN
         actor_3_facebook_likes     NaN
                                    ... 
         actor_2_facebook_likes     NaN
         imdb_score                 1.6
         aspect_ratio               NaN
         movie_facebook_likes       0.0
         Length: 16, dtype: float64

4. 串联DataFrame方法

# 使用isnull方法将每个值转变为布尔值
 In[30]: movie = pd.read_csv('data/movie.csv')
         movie.isnull().head()
Out[30]: 
img_460585c5d9df2c44143b98f81ce94804.png
# 使用sum统计布尔值,返回的是Series
 In[31]: movie.isnull().sum().head()
Out[31]: color                       19
         director_name              102
         num_critic_for_reviews      49
         duration                    15
         director_facebook_likes    102
         dtype: int64
# 对这个Series再使用sum,返回整个DataFrame的缺失值的个数,返回值是个标量
 In[32]: movie.isnull().sum().sum()
Out[32]: 2654
# 判断整个DataFrame有没有缺失值,方法是连着使用两个any
 In[33]: movie.isnull().any().any()
Out[33]: True

原理

# isnull返回同样大小的DataFrame,但所有的值变为布尔值
 In[34]: movie.isnull().get_dtype_counts()
Out[34]: bool    28
         dtype: int64

更多

# movie数据集的对象数据包含缺失值。默认条件下,聚合方法min、max、sum,不会返回任何值。
 In[35]: movie[['color', 'movie_title', 'color']].max()
Out[35]: Series([], dtype: float64)
# 要让pandas强行返回每列的值,必须填入缺失值。下面填入的是空字符串:
 In[36]: movie.select_dtypes(['object']).fillna('').max()
Out[36]: color                                                          Color
         director_name                                          Étienne Faure
         actor_2_name                                           Zubaida Sahar
         genres                                                       Western
                                                                           ...                        
         movie_imdb_link    [http://www.imdb.com/title/tt5574490/?ref_=fn_t...](http://www.imdb.com/title/tt5574490/?ref_=fn_t...)
         language                                                        Zulu
         country                                                 West Germany
         content_rating                                                     X
         Length: 12, dtype: object</pre>

5. 在DataFrame上使用运算符

# college数据集的值既有数值也有对象,整数5不能与字符串相加
 In[37]: college = pd.read_csv('data/college.csv')
         college + 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1175             result = expressions.evaluate(op, str_rep, x, y,
-> 1176                                           raise_on_error=True, **eval_kwargs)
   1177         except TypeError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    210         return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 211                          **eval_kwargs)
    212     return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs)
    121     if result is None:
--> 122         result = _evaluate_standard(op, op_str, a, b, raise_on_error)
    123 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1183             with np.errstate(all='ignore'):
-> 1184                 result = get_result(other)
   1185 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get_result(other)
   1152             else:
-> 1153                 result = func(values, other)
   1154 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
   1201                     with np.errstate(all='ignore'):
-> 1202                         result[mask] = op(xrav, y)
   1203             else:

TypeError: must be str, not int

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-37-4749f68a2501> in <module>()
      1 college = pd.read_csv('data/college.csv')
----> 2 college + 5

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other, axis, level, fill_value)
   1239                 self = self.fillna(fill_value)
   1240 
-> 1241             return self._combine_const(other, na_op)
   1242 
   1243     f.__name__ = name

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error)
   3541     def _combine_const(self, other, func, raise_on_error=True):
   3542         new_data = self._data.eval(func=func, other=other,
-> 3543                                    raise_on_error=raise_on_error)
   3544         return self._constructor(new_data)
   3545 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1189             raise
   1190         except Exception as detail:
-> 1191             result = handle_error()
   1192 
   1193         # technically a broadcast error in numpy can 'work' by returning a

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in handle_error()
   1172                 # The 'detail' variable is defined in outer scope.
   1173                 raise TypeError('Could not operate %s with block values %s' %
-> 1174                                 (repr(other), str(detail)))  # noqa
   1175             else:
   1176                 # return the values

TypeError: Could not operate 5 with block values must be str, not int
# 行索引名设为INSTNM,用UGDS_过滤出本科生的种族比例
In[38]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
        college_ugds_ = college.filter(like='UGDS_')
In[39]: college == 'asdf' # 这是jn上的,想要比较college和‘asdf’,没有意义,忽略
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-697c8af60bcf> in <module>()
----> 1 college == 'asdf'

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other)
   1302             # straight boolean comparisions we want to allow all columns
   1303             # (regardless of dtype to pass thru) See #4537 for discussion.
-> 1304             res = self._combine_const(other, func, raise_on_error=False)
   1305             return res.fillna(True).astype(bool)
   1306 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _combine_const(self, other, func, raise_on_error)
   3541     def _combine_const(self, other, func, raise_on_error=True):
   3542         new_data = self._data.eval(func=func, other=other,
-> 3543                                    raise_on_error=raise_on_error)
   3544         return self._constructor(new_data)
   3545 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1203 
   1204                 raise TypeError('Could not compare [%s] with block values' %
-> 1205                                 repr(other))
   1206 
   1207         # transpose if needed

TypeError: Could not compare ['asdf'] with block values
# 查看前5行
 In[40]: college_ugds_.head()
Out[40]: 
img_c873a161e2edc78521c63b54e7441141.png
# 现在都是均质数据了,可以进行数值运算
 In[41]: college_ugds_.head() + .00501
Out[41]: 
img_9012a9bbdedfa8b98d52c2ac0366da00.png
# 用底除计算百分比分数
 In[42]: (college_ugds_.head() + .00501) // .01
Out[42]: 
img_60eb0894d4298f0bfab9adffcca62691.png
# 再除以100
 In[43]: college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
         college_ugds_op_round.head()
Out[43]: 
img_8e5e5bc6dfd502407f68e6c908573f0b.png
# 保留两位小数
 In[44]: college_ugds_round = (college_ugds_ + .00001).round(2)
         college_ugds_round.head()
Out[44]: 
img_91eb32e7d7d175e0668fdaefa48dc25f.png
 In[45]: .045 + .005
Out[45]: 0.049999999999999996
 In[46]: college_ugds_op_round.equals(college_ugds_round)
Out[46]: True

更多

# DataFrame的通用函数也可以实现上述方法
 In[47]: college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)

6. 比较缺失值

# Pandas使用NumPy NaN(np.nan)对象表示缺失值。这是一个不等于自身的特殊对象:
 In[48]: np.nan == np.nan
Out[48]: False
# Python的None对象是等于自身的
 In[49]: None == None
Out[49]: True
# 所有和np.nan的比较都返回False,除了不等于:
 In[50]: 5 > np.nan
Out[50]: False

 In[51]: np.nan > 5
Out[51]: False

 In[52]: 5 != np.nan
Out[52]: True
# college_ugds_所有值和.0019比较,返回布尔值DataFrame
 In[53]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
 In[54]: college_ugds_.head() == .0019
Out[54]: 
img_be583afe44ded4547d2ef7c1967da3ae.png
# 用DataFrame和DataFrame进行比较
 In[55]: college_self_compare = college_ugds_ == college_ugds_
         college_self_compare.head()
Out[55]: 
img_61cefc4a9fb7c749d949307945b5a91d.png
# 用all()检查是否所有的值都是True;这是因为缺失值不互相等于。
 In[56]: college_self_compare.all()
Out[56]: UGDS_WHITE    False
         UGDS_BLACK    False
         UGDS_HISP     False
         UGDS_ASIAN    False
                        ...  
         UGDS_NHPI     False
         UGDS_2MOR     False
         UGDS_NRA      False
         UGDS_UNKN     False
         Length: 9, dtype: bool
# 可以用==号判断,然后求和
 In[57]: (college_ugds_ == np.nan).sum()
Out[57]: UGDS_WHITE    0
         UGDS_BLACK    0
         UGDS_HISP     0
         UGDS_ASIAN    0
                      ..
         UGDS_NHPI     0
         UGDS_2MOR     0
         UGDS_NRA      0
         UGDS_UNKN     0
         Length: 9, dtype: int64
# 统计缺失值最主要方法是使用isnull方法:
 In[58]: college_ugds_.isnull().sum()
Out[58]: UGDS_WHITE    661
         UGDS_BLACK    661
         UGDS_HISP     661
         UGDS_ASIAN    661
                       ... 
         UGDS_NHPI     661
         UGDS_2MOR     661
         UGDS_NRA      661
         UGDS_UNKN     661
         Length: 9, dtype: int64
# 比较两个DataFrame最直接的方法是使用equals()方法
 In[59]: from pandas.testing import assert_frame_equal
 In[60]: assert_frame_equal(college_ugds_, college_ugds_)
Out[60]: True

更多

# eq()方法类似于==,和前面的equals有所不同
 In[61]: college_ugds_.eq(.0019).head()
Out[61]: 
img_7131a9f36178beb351773a4059ed189d.png

7. 矩阵转置

 In[62]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
         college_ugds_.head()
Out[62]: 
img_aa668d72d995819003e75c833d7f105d.png
# count()返回非缺失值的个数
 In[63]: college_ugds_.count()
Out[63]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64
# axis默认设为0
 In[64]: college_ugds_.count(axis=0)
Out[64]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64
# 等价于axis='index'
 In[65]: college_ugds_.count(axis='index')
Out[65]: UGDS_WHITE    6874
         UGDS_BLACK    6874
         UGDS_HISP     6874
         UGDS_ASIAN    6874
                       ... 
         UGDS_NHPI     6874
         UGDS_2MOR     6874
         UGDS_NRA      6874
         UGDS_UNKN     6874
         Length: 9, dtype: int64
# 统计每行的非缺失值个数
 In[66]: college_ugds_.count(axis='columns').head()
Out[66]: INSTNM
         Alabama A & M University               9
         University of Alabama at Birmingham    9
         Amridge University                     9
         University of Alabama in Huntsville    9
         Alabama State University               9
         dtype: int64
# 除了统计每行的非缺失值个数,也可以求和加以确认
 In[67]: college_ugds_.sum(axis='columns').head()
Out[67]: INSTNM
         Alabama A & M University               1.0000
         University of Alabama at Birmingham    0.9999
         Amridge University                     1.0000
         University of Alabama in Huntsville    1.0000
         Alabama State University               1.0000
         dtype: float64
# 用中位数了解每列的分布
 In[68]: college_ugds_.median(axis='index')
Out[68]: UGDS_WHITE    0.55570
         UGDS_BLACK    0.10005
         UGDS_HISP     0.07140
         UGDS_ASIAN    0.01290
                        ...   
         UGDS_NHPI     0.00000
         UGDS_2MOR     0.01750
         UGDS_NRA      0.00000
         UGDS_UNKN     0.01430
         Length: 9, dtype: float64

更多

# 使用累积求和cumsum()可以很容易看到白人、黑人、西班牙裔的比例
 In[69]: college_ugds_cumsum = college_ugds_.cumsum(axis=1)
         college_ugds_cumsum.head()
Out[69]: 
img_fa3ed3795a4c001e1b9e22bbb06c92b2.png
# UGDS_HISP一列降序排列
 In[70]: college_ugds_cumsum.sort_values('UGDS_HISP', ascending=False)
Out[70]: 
img_c8780b08243d2951d494294ad2002f6d.png

8. 确定大学校园多样性

# US News给出的美国10所最具多样性的大学
 In[71]: pd.read_csv('data/college_diversity.csv', index_col='School')
Out[71]: 
img_927e7038a9fd63e7e84305df6605780a.png
 In[72]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college_ugds_ = college.filter(like='UGDS_')
         college_ugds_.head()
Out[72]: 
img_ca164b40cb1b940030f964cfd40137d0.png
 In[73]: college_ugds_.isnull().sum(axis=1).sort_values(ascending=False).head()
Out[73]: INSTNM
         Excel Learning Center-San Antonio South         9
         Philadelphia College of Osteopathic Medicine    9
         Assemblies of God Theological Seminary          9
         Episcopal Divinity School                       9
         Phillips Graduate Institute                     9
         dtype: int64
# 如果所有列都是缺失值,则将其去除
 In[74]: college_ugds_ = college_ugds_.dropna(how='all')
 In[75]: college_ugds_.isnull().sum()
Out[75]: UGDS_WHITE    0
         UGDS_BLACK    0
         UGDS_HISP     0
         UGDS_ASIAN    0
                       ..
         UGDS_NHPI     0
         UGDS_2MOR     0
         UGDS_NRA      0
         UGDS_UNKN     0
         Length: 9, dtype: int64
# 用大于或等于方法ge(),将DataFrame变为布尔值矩阵
 In[76]: college_ugds_.ge(.15).head()
Out[76]: 
img_f387c44f44cca30168cb443cb6bdacd5.png
# 对所有True值求和
 In[77]: diversity_metric = college_ugds_.ge(.15).sum(axis='columns')
         diversity_metric.head()
Out[77]: INSTNM
         Alabama A & M University               1
         University of Alabama at Birmingham    2
         Amridge University                     3
         University of Alabama in Huntsville    1
         Alabama State University               1
         dtype: int64
# 使用value_counts(),查看分布情况
 In[78]: diversity_metric.value_counts()
Out[78]: 1    3042
         2    2884
         3     876
         4      63
         0       7
         5       2
         dtype: int64
# 查看哪些学校种群比例超过15%的数量多
 In[79]: diversity_metric.sort_values(ascending=False).head()
Out[79]: INSTNM
         Regency Beauty Institute-Austin          5
         Central Texas Beauty College-Temple      5
         Sullivan and Cogliano Training Center    4
         Ambria College of Nursing                4
         Berkeley College-New York                4
         dtype: int64
# 用loc()方法查看对应行索引的行
 In[80]: college_ugds_.loc[['Regency Beauty Institute-Austin', 
                          'Central Texas Beauty College-Temple']]
Out[80]:
img_21b93516e86c526dfa4adca528d13c47.png
# 查看US News前五所最具多样性的大学在diversity_metric中的情况
 In[81]: us_news_top = ['Rutgers University-Newark', 
               'Andrews University', 
               'Stanford University', 
               'University of Houston',
               'University of Nevada-Las Vegas']
 In[82]: diversity_metric.loc[us_news_top]
Out[82]: INSTNM
         Rutgers University-Newark         4
         Andrews University                3
         Stanford University               3
         University of Houston             3
         University of Nevada-Las Vegas    3
         dtype: int64

更多

# 可以用最大种群比例查看哪些学校最不具有多样性
 In[83]: college_ugds_.max(axis=1).sort_values(ascending=False).head(10)
Out[83]: INSTNM
         Dewey University-Manati                               1.0
         Yeshiva and Kollel Harbotzas Torah                    1.0
         Mr Leon's School of Hair Design-Lewiston              1.0
         Dewey University-Bayamon                              1.0
                                                               ... 
         Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
         Yeshiva Shaar Hatorah                                 1.0
         Bais Medrash Elyon                                    1.0
         Yeshiva of Nitra Rabbinical College                   1.0
         Length: 10, dtype: float64
# 查看Talmudical Seminary Oholei Torah哲学学校
 In[84]: college_ugds_.loc['Talmudical Seminary Oholei Torah']
Out[84]: UGDS_WHITE    1.0
         UGDS_BLACK    0.0
         UGDS_HISP     0.0
         UGDS_ASIAN    0.0
                       ... 
         UGDS_NHPI     0.0
         UGDS_2MOR     0.0
         UGDS_NRA      0.0
         UGDS_UNKN     0.0
         Name: Talmudical Seminary Oholei Torah, Length: 9, dtype: float64
# 查看是否有学校九个种族的比例都超过了1%
 In[85]: (college_ugds_ > .01).all(axis=1).any()
Out[85]: True

第01章 Pandas基础
第02章 DataFrame运算
第03章 数据分析入门
第04章 选取数据子集
第05章 布尔索引
第06章 索引对齐
第07章 分组聚合、过滤、转换
第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化


网友评论

登录后评论
0/500
评论
seancheney
+ 关注