Skip to content

Python | Pandas | Hierarchical Indexing

Posted on:February 18, 2019

Series可以存储一维数据,DataFrame可以存储二维数据。多维数据可以利用Pandas的层级索引(多级索引)配合多个有着不同等级(level)的一级索引来表现。这样可以将高维数组转化为类似一维Series或二维DataFrame的形式。

Table of contents

1. 多级索引的创建

1.1 直接将index参数设置为至少二维的索引数组

# index数组内部包含另外两个数组
# 第一个数组:每行的第一层索引的顺序排列
# 第二个数组:每行的第二层索引的顺序排列
rng = np.random.RandomState(0)
df = pd.DataFrame(rng.randint(0,100,(4,2)),
                  index=[['a','a','b','b'],[1,2,1,2]],
                  columns=['data1','data2'])
df
data1 data2
a 1 44 47
2 64 67
b 1 67 9
2 83 21

1.2 将元组作为键的字典传递给Pandas

data = {('California',2000):33871648,
        ('California',2010):37253956,
        ('Texas',2000):20851820,
        ('Texas',2010):25145561,
        ('New York',2000):18976457,
        ('New York',2010):19378102}
pd.Series(data)
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

1.3 显式创建索引对象后传入index参数

1) MultiIndex对象

以下3种创建方法等价,均产生同一个多级索引对象:

# from_arrays方法
pd.MultiIndex.from_arrays([['a','a','b','b'],
                          [1,2,1,2]])
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
# from_tuples方法
pd.MultiIndex.from_tuples([('a',1),('a',2),
                          ('b',1),('b',2)])
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
# from_product方法
pd.MultiIndex.from_product([['a','b'],
                           [1,2]])
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

注意:新版本的pandas不再支持直接使用MultiIndex(levels=xxx, labels=xxx)的方式创建多级索引对象,仅支持使用3种from函数创建:*A new MultiIndex is typically constructed using one of the helper methods MultiIndex.from_arrays(), MultiIndex.from_product() and MultiIndex.from_tuples(). 故以下方法报错:

# 构造函数
pd.MultiIndex(levels=[['a','b'],[1,2]],
             labels=[[0,0,1,1],[0,1,0,1]])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/var/folders/6b/77kzlkh11f959wbt2yf57jhm0000gn/T/ipykernel_70439/2062238608.py in <module>
      1 # 构造函数
----> 2 pd.MultiIndex(levels=[['a','b'],[1,2]],
      3              labels=[[0,0,1,1],[0,1,0,1]])


TypeError: __new__() got an unexpected keyword argument 'labels'

2) MultiIndex名称

# 数据准备
index = pd.MultiIndex.from_product([['California', 'Texas', 'New York'],
                                   [2000,2010]])
data = pd.Series([33871648,37253956,20851820,25145561,18976457,19378102], index=index)
data
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64
# 设置多级索引中各层索引的名称
data.index.names=['state','year']
data
state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

3) 多级列索引

# 数据准备
# 设置多级行索引
index = pd.MultiIndex.from_product([[2013,2014],[1,2]])
# 设置多级列索引
columns = pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']])
# 数据模拟
rng=np.random.RandomState(0)
data = np.round(rng.randn(4,6),1)
data[:,::2]*=10
data += 37
# 创建df
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Bob Guido Sue
HR Temp HR Temp HR Temp
2013 1 55.0 37.4 47.0 39.2 56.0 36.0
2 47.0 36.8 36.0 37.4 38.0 38.5
2014 1 45.0 37.1 41.0 37.3 52.0 36.8
2 40.0 36.1 11.0 37.7 46.0 36.3

2. 取值和切片

2.1 Series的多级索引

1) 取值

# 数据准备
index = pd.MultiIndex.from_product([['California', 'Texas', 'New York','Washington'],
                                    [2000,2010,2015]], names=['state','year'])
data = pd.Series([33871648,37253956,20851820,25145561,18976457,19378102,
                  2678192,37281963,12783921,20987648,37628192,10296478], index=index)
data
state       year
California  2000    33871648
            2010    37253956
            2015    20851820
Texas       2000    25145561
            2010    18976457
            2015    19378102
New York    2000     2678192
            2010    37281963
            2015    12783921
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64
# California州2010年数据
data['California',2010]
37253956
# California州所有年份数据
data['California']
year
2000    33871648
2010    37253956
2015    20851820
dtype: int64
# 2000年所有州的数据
# 用较低层级的索引取值时,高层索引可以用空切片,但要注意此时索引应该排序
data[:,2000]
state
California    33871648
Texas         25145561
New York       2678192
Washington    20987648
dtype: int64

但还是强烈建议直接使用索引器indexer中的loc或iloc!

# 花哨索引
data.loc[:,[2000,2010]]
state       year
California  2000    33871648
Texas       2000    25145561
New York    2000     2678192
Washington  2000    20987648
California  2010    37253956
Texas       2010    18976457
New York    2010    37281963
Washington  2010    37628192
dtype: int64

2) 切片

data
state       year
California  2000    33871648
            2010    37253956
            2015    20851820
Texas       2000    25145561
            2010    18976457
            2015    19378102
New York    2000     2678192
            2010    37281963
            2015    12783921
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64
# 直接切片会报错(UnsortedIndexError),必须首先对索引排序
# data.loc["California":"Texas"]
data_sorted = data.sort_index()
data_sorted
state       year
California  2000    33871648
            2010    37253956
            2015    20851820
New York    2000     2678192
            2010    37281963
            2015    12783921
Texas       2000    25145561
            2010    18976457
            2015    19378102
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64
data_sorted.loc["California":"Texas"]
state       year
California  2000    33871648
            2010    37253956
            2015    20851820
New York    2000     2678192
            2010    37281963
            2015    12783921
Texas       2000    25145561
            2010    18976457
            2015    19378102
dtype: int64
data_sorted.loc[:, 2010:2015]
state       year
California  2010    37253956
            2015    20851820
New York    2010    37281963
            2015    12783921
Texas       2010    18976457
            2015    19378102
Washington  2010    37628192
            2015    10296478
dtype: int64

2.2 DataFrame的多级索引

health_data
Bob Guido Sue
HR Temp HR Temp HR Temp
2013 1 55.0 37.4 47.0 39.2 56.0 36.0
2 47.0 36.8 36.0 37.4 38.0 38.5
2014 1 45.0 37.1 41.0 37.3 52.0 36.8
2 40.0 36.1 11.0 37.7 46.0 36.3
health_data['Guido','HR']
2013  1    47.0
      2    36.0
2014  1    41.0
      2    11.0
Name: (Guido, HR), dtype: float64
health_data.iloc[:3, :3]
Bob Guido
HR Temp HR
2013 1 55.0 37.4 47.0
2 47.0 36.8 36.0
2014 1 45.0 37.1 41.0
# 多层级索引取值或切片
idx = pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
Bob Guido Sue
HR HR HR
2013 1 55.0 47.0 56.0
2014 1 45.0 41.0 52.0
health_data.loc[idx[2013],idx['Guido']]
HR Temp
1 47.0 39.2
2 36.0 37.4

3. 行列转换

3.1. 有序索引和无序索引

如果MultiIndex不是有序的索引——不按字典顺序的(lexographically),大多数切片操作都会失败。

rng = np.random.RandomState(0)
# data的第一层索引无序,切片操作会出错
index = pd.MultiIndex.from_arrays([['a','a','c','c','b','b'],
                                   [1,2,1,2,1,2]])
data = pd.Series(rng.randn(6), index=index)
data
a  1    1.764052
   2    0.400157
c  1    0.978738
   2    2.240893
b  1    1.867558
   2   -0.977278
dtype: float64
# 索引排序
data = data.sort_index()
data
a  1    1.764052
   2    0.400157
b  1    1.867558
   2   -0.977278
c  1    0.978738
   2    2.240893
dtype: float64

3.2 stack与unstack

data.unstack()
1 2
a 1.764052 0.400157
b 1.867558 -0.977278
c 0.978738 2.240893
data.unstack(level=0)
a b c
1 1.764052 1.867558 0.978738
2 0.400157 -0.977278 2.240893
data.unstack(level=1)
1 2
a 1.764052 0.400157
b 1.867558 -0.977278
c 0.978738 2.240893

stackunstack的逆操作。

3.3 索引的设置与重置

reset_index():还原索引,即:将索引(即使是多层索引)还原为默认的整型索引;返回DataFrame:

# 例1-1
index = pd.MultiIndex.from_product([['California', 'Texas', 'New York','Washington'],
                                    [2000,2010,2015]], names=['state','year'])
data = pd.Series([33871648,37253956,20851820,25145561,18976457,19378102,
                  2678192,37281963,12783921,20987648,37628192,10296478], index=index)
data
state       year
California  2000    33871648
            2010    37253956
            2015    20851820
Texas       2000    25145561
            2010    18976457
            2015    19378102
New York    2000     2678192
            2010    37281963
            2015    12783921
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64
# 例1-2, name参数只对Series对象的reset_index方法
data.reset_index(name='pop')
state year pop
0 California 2000 33871648
1 California 2010 37253956
2 California 2015 20851820
3 Texas 2000 25145561
4 Texas 2010 18976457
5 Texas 2015 19378102
6 New York 2000 2678192
7 New York 2010 37281963
8 New York 2015 12783921
9 Washington 2000 20987648
10 Washington 2010 37628192
11 Washington 2015 10296478
# 例2-1
df = pd.DataFrame([('bird',    389.0),
                   ('bird',     24.0),
                   ('mammal',   80.5),
                   ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],
                  columns=('class', 'max_speed'))
df
class max_speed
falcon bird 389.0
parrot bird 24.0
lion mammal 80.5
monkey mammal NaN
df.reset_index()
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN
# 丢弃原索引列
df.reset_index(drop=True)
class max_speed
0 bird 389.0
1 bird 24.0
2 mammal 80.5
3 mammal NaN

set_index():设置索引

data = pd.DataFrame({'a':['bar','bar','foo','foo'],
                     'b':['one','two','one','two'],
                     'c':['z','y','x','w'],'d':[1.0,2,3,4]})
data
a b c d
0 bar one z 1.0
1 bar two y 2.0
2 foo one x 3.0
3 foo two w 4.0
data.set_index('a')
b c d
a
bar one z 1.0
bar two y 2.0
foo one x 3.0
foo two w 4.0
data.set_index(['a','b'])
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
data.set_index(['b','a'])
c d
b a
one bar z 1.0
two bar y 2.0
one foo x 3.0
two foo w 4.0

4. 数据累计方法

设置参数level实现对子集的累积操作;配合axis可以对列索引进行类似的累积操作。

data = health_data.copy()
data
Bob Guido Sue
HR Temp HR Temp HR Temp
2013 1 55.0 37.4 47.0 39.2 56.0 36.0
2 47.0 36.8 36.0 37.4 38.0 38.5
2014 1 45.0 37.1 41.0 37.3 52.0 36.8
2 40.0 36.1 11.0 37.7 46.0 36.3
data.index.names=['year','visit']
data.columns.names=['subject','type']
data
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 55.0 37.4 47.0 39.2 56.0 36.0
2 47.0 36.8 36.0 37.4 38.0 38.5
2014 1 45.0 37.1 41.0 37.3 52.0 36.8
2 40.0 36.1 11.0 37.7 46.0 36.3
# 此方法即将被废弃
data.mean(level='year')
/var/folders/6b/77kzlkh11f959wbt2yf57jhm0000gn/T/ipykernel_70439/1113942518.py:2: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.median(level=1) should use df.groupby(level=1).median().
  data.mean(level='year')
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year
2013 51.0 37.1 41.5 38.3 47.0 37.25
2014 42.5 36.6 26.0 37.5 49.0 36.55
# 等价方法
data.groupby(level=0).mean()
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year
2013 51.0 37.1 41.5 38.3 47.0 37.25
2014 42.5 36.6 26.0 37.5 49.0 36.55
# 此方法即将被废弃
data.mean(level='type',axis=1)
/var/folders/6b/77kzlkh11f959wbt2yf57jhm0000gn/T/ipykernel_70439/3859318972.py:2: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.median(level=1) should use df.groupby(level=1).median().
  data.mean(level='type',axis=1)
type HR Temp
year visit
2013 1 52.666667 37.533333
2 40.333333 37.566667
2014 1 46.000000 37.066667
2 32.333333 36.700000
# 等价方法
# axis=1表示聚合维度在「列」
# level=1表示聚合的字段是「type」
data.groupby(level=1, axis=1).mean()
type HR Temp
year visit
2013 1 52.666667 37.533333
2 40.333333 37.566667
2014 1 46.000000 37.066667
2 32.333333 36.700000