Python | Pandas | Hierarchical Indexing

Series可以存储一维数据，DataFrame可以存储二维数据。多维数据可以利用Pandas的层级索引（多级索引）配合多个有着不同等级（level）的一级索引来表现。这样可以将高维数组转化为类似一维Series或二维DataFrame的形式。

1. 多级索引的创建
2. 取值和切片
- 2.1 Series的多级索引
  - 1) 取值
  - 2) 切片
- 2.2 DataFrame的多级索引
3. 行列转换
4. 数据累计方法

1. 多级索引的创建

1.1 直接将index参数设置为至少二维的索引数组

# index数组内部包含另外两个数组
# 第一个数组：每行的第一层索引的顺序排列
# 第二个数组：每行的第二层索引的顺序排列
rng = np.random.RandomState(0)
df = pd.DataFrame(rng.randint(0,100,(4,2)),
                  index=[['a','a','b','b'],[1,2,1,2]],
                  columns=['data1','data2'])
df

		data1	data2
a	1	44	47
a	2	64	67
b	1	67	9
b	2	83	21

1.2 将元组作为键的字典传递给Pandas

data = {('California',2000):33871648,
        ('California',2010):37253956,
        ('Texas',2000):20851820,
        ('Texas',2010):25145561,
        ('New York',2000):18976457,
        ('New York',2010):19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

1.3 显式创建索引对象后传入index参数

1) MultiIndex对象

以下3种创建方法等价，均产生同一个多级索引对象：

# from_arrays方法
pd.MultiIndex.from_arrays([['a','a','b','b'],
                          [1,2,1,2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

# from_tuples方法
pd.MultiIndex.from_tuples([('a',1),('a',2),
                          ('b',1),('b',2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

# from_product方法
pd.MultiIndex.from_product([['a','b'],
                           [1,2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

注意：新版本的pandas不再支持直接使用MultiIndex(levels=xxx, labels=xxx)的方式创建多级索引对象，仅支持使用3种from函数创建：*A new MultiIndex is typically constructed using one of the helper methods MultiIndex.from_arrays(), MultiIndex.from_product() and MultiIndex.from_tuples(). 故以下方法报错：

# 构造函数
pd.MultiIndex(levels=[['a','b'],[1,2]],
             labels=[[0,0,1,1],[0,1,0,1]])

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/var/folders/6b/77kzlkh11f959wbt2yf57jhm0000gn/T/ipykernel_70439/2062238608.py in <module>
      1 # 构造函数
----> 2 pd.MultiIndex(levels=[['a','b'],[1,2]],
      3              labels=[[0,0,1,1],[0,1,0,1]])


TypeError: __new__() got an unexpected keyword argument 'labels'

2) MultiIndex名称

# 数据准备
index = pd.MultiIndex.from_product([['California', 'Texas', 'New York'],
                                   [2000,2010]])
data = pd.Series([33871648,37253956,20851820,25145561,18976457,19378102], index=index)
data

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

# 设置多级索引中各层索引的名称
data.index.names=['state','year']
data

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

3) 多级列索引

# 数据准备
# 设置多级行索引
index = pd.MultiIndex.from_product([[2013,2014],[1,2]])
# 设置多级列索引
columns = pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']])
# 数据模拟
rng=np.random.RandomState(0)
data = np.round(rng.randn(4,6),1)
data[:,::2]*=10
data += 37
# 创建df
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

		Bob		Guido		Sue
		HR	Temp	HR	Temp	HR	Temp
2013	1	55.0	37.4	47.0	39.2	56.0	36.0
2013	2	47.0	36.8	36.0	37.4	38.0	38.5
2014	1	45.0	37.1	41.0	37.3	52.0	36.8
2014	2	40.0	36.1	11.0	37.7	46.0	36.3

2. 取值和切片

2.1 Series的多级索引

1) 取值

# 数据准备
index = pd.MultiIndex.from_product([['California', 'Texas', 'New York','Washington'],
                                    [2000,2010,2015]], names=['state','year'])
data = pd.Series([33871648,37253956,20851820,25145561,18976457,19378102,
                  2678192,37281963,12783921,20987648,37628192,10296478], index=index)
data

state       year
California  2000    33871648
            2010    37253956
            2015    20851820
Texas       2000    25145561
            2010    18976457
            2015    19378102
New York    2000     2678192
            2010    37281963
            2015    12783921
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64

# California州2010年数据
data['California',2010]

37253956

# California州所有年份数据
data['California']

year
2000    33871648
2010    37253956
2015    20851820
dtype: int64

# 2000年所有州的数据
# 用较低层级的索引取值时，高层索引可以用空切片，但要注意此时索引应该排序
data[:,2000]

state
California    33871648
Texas         25145561
New York       2678192
Washington    20987648
dtype: int64

但还是强烈建议直接使用索引器indexer中的loc或iloc！

# 花哨索引
data.loc[:,[2000,2010]]

state       year
California  2000    33871648
Texas       2000    25145561
New York    2000     2678192
Washington  2000    20987648
California  2010    37253956
Texas       2010    18976457
New York    2010    37281963
Washington  2010    37628192
dtype: int64

2) 切片

data

state       year
California  2000    33871648
            2010    37253956
            2015    20851820
Texas       2000    25145561
            2010    18976457
            2015    19378102
New York    2000     2678192
            2010    37281963
            2015    12783921
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64

# 直接切片会报错(UnsortedIndexError)，必须首先对索引排序
# data.loc["California":"Texas"]

data_sorted = data.sort_index()
data_sorted

state       year
California  2000    33871648
            2010    37253956
            2015    20851820
New York    2000     2678192
            2010    37281963
            2015    12783921
Texas       2000    25145561
            2010    18976457
            2015    19378102
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64

data_sorted.loc["California":"Texas"]

state       year
California  2000    33871648
            2010    37253956
            2015    20851820
New York    2000     2678192
            2010    37281963
            2015    12783921
Texas       2000    25145561
            2010    18976457
            2015    19378102
dtype: int64

data_sorted.loc[:, 2010:2015]

state       year
California  2010    37253956
            2015    20851820
New York    2010    37281963
            2015    12783921
Texas       2010    18976457
            2015    19378102
Washington  2010    37628192
            2015    10296478
dtype: int64

2.2 DataFrame的多级索引

health_data

		Bob		Guido		Sue
		HR	Temp	HR	Temp	HR	Temp
2013	1	55.0	37.4	47.0	39.2	56.0	36.0
2013	2	47.0	36.8	36.0	37.4	38.0	38.5
2014	1	45.0	37.1	41.0	37.3	52.0	36.8
2014	2	40.0	36.1	11.0	37.7	46.0	36.3

health_data['Guido','HR']

2013  1    47.0
      2    36.0
2014  1    41.0
      2    11.0
Name: (Guido, HR), dtype: float64

health_data.iloc[:3, :3]

		Bob		Guido
		HR	Temp	HR
2013	1	55.0	37.4	47.0
2013	2	47.0	36.8	36.0
2014	1	45.0	37.1	41.0

# 多层级索引取值或切片
idx = pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]

		Bob	Guido	Sue
		HR	HR	HR
2013	1	55.0	47.0	56.0
2014	1	45.0	41.0	52.0

health_data.loc[idx[2013],idx['Guido']]

	HR	Temp
1	47.0	39.2
2	36.0	37.4

3. 行列转换

3.1. 有序索引和无序索引

如果MultiIndex不是有序的索引——不按字典顺序的（lexographically），大多数切片操作都会失败。

rng = np.random.RandomState(0)
# data的第一层索引无序，切片操作会出错
index = pd.MultiIndex.from_arrays([['a','a','c','c','b','b'],
                                   [1,2,1,2,1,2]])
data = pd.Series(rng.randn(6), index=index)
data

a  1    1.764052
   2    0.400157
c  1    0.978738
   2    2.240893
b  1    1.867558
   2   -0.977278
dtype: float64

# 索引排序
data = data.sort_index()
data

a  1    1.764052
   2    0.400157
b  1    1.867558
   2   -0.977278
c  1    0.978738
   2    2.240893
dtype: float64

3.2 stack与unstack

data.unstack()

	1	2
a	1.764052	0.400157
b	1.867558	-0.977278
c	0.978738	2.240893

data.unstack(level=0)

	a	b	c
1	1.764052	1.867558	0.978738
2	0.400157	-0.977278	2.240893

data.unstack(level=1)

	1	2
a	1.764052	0.400157
b	1.867558	-0.977278
c	0.978738	2.240893

stack是unstack的逆操作。

3.3 索引的设置与重置

reset_index()：还原索引，即：将索引（即使是多层索引）还原为默认的整型索引；返回DataFrame：

# 例1-1
index = pd.MultiIndex.from_product([['California', 'Texas', 'New York','Washington'],
                                    [2000,2010,2015]], names=['state','year'])
data = pd.Series([33871648,37253956,20851820,25145561,18976457,19378102,
                  2678192,37281963,12783921,20987648,37628192,10296478], index=index)
data

state       year
California  2000    33871648
            2010    37253956
            2015    20851820
Texas       2000    25145561
            2010    18976457
            2015    19378102
New York    2000     2678192
            2010    37281963
            2015    12783921
Washington  2000    20987648
            2010    37628192
            2015    10296478
dtype: int64

# 例1-2, name参数只对Series对象的reset_index方法
data.reset_index(name='pop')

	state	year	pop
0	California	2000	33871648
1	California	2010	37253956
2	California	2015	20851820
3	Texas	2000	25145561
4	Texas	2010	18976457
5	Texas	2015	19378102
6	New York	2000	2678192
7	New York	2010	37281963
8	New York	2015	12783921
9	Washington	2000	20987648
10	Washington	2010	37628192
11	Washington	2015	10296478

# 例2-1
df = pd.DataFrame([('bird',    389.0),
                   ('bird',     24.0),
                   ('mammal',   80.5),
                   ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],
                  columns=('class', 'max_speed'))
df

	class	max_speed
falcon	bird	389.0
parrot	bird	24.0
lion	mammal	80.5
monkey	mammal	NaN

df.reset_index()

	index	class	max_speed
0	falcon	bird	389.0
1	parrot	bird	24.0
2	lion	mammal	80.5
3	monkey	mammal	NaN

# 丢弃原索引列
df.reset_index(drop=True)

	class	max_speed
0	bird	389.0
1	bird	24.0
2	mammal	80.5
3	mammal	NaN

set_index()：设置索引

data = pd.DataFrame({'a':['bar','bar','foo','foo'],
                     'b':['one','two','one','two'],
                     'c':['z','y','x','w'],'d':[1.0,2,3,4]})
data

	a	b	c	d
0	bar	one	z	1.0
1	bar	two	y	2.0
2	foo	one	x	3.0
3	foo	two	w	4.0

data.set_index('a')

	b	c	d
a
bar	one	z	1.0
bar	two	y	2.0
foo	one	x	3.0
foo	two	w	4.0

data.set_index(['a','b'])

		c	d
a	b
bar	one	z	1.0
bar	two	y	2.0
foo	one	x	3.0
foo	two	w	4.0

data.set_index(['b','a'])

		c	d
b	a
one	bar	z	1.0
two	bar	y	2.0
one	foo	x	3.0
two	foo	w	4.0

4. 数据累计方法

设置参数level实现对子集的累积操作；配合axis可以对列索引进行类似的累积操作。

data = health_data.copy()
data

		Bob		Guido		Sue
		HR	Temp	HR	Temp	HR	Temp
2013	1	55.0	37.4	47.0	39.2	56.0	36.0
2013	2	47.0	36.8	36.0	37.4	38.0	38.5
2014	1	45.0	37.1	41.0	37.3	52.0	36.8
2014	2	40.0	36.1	11.0	37.7	46.0	36.3

data.index.names=['year','visit']
data.columns.names=['subject','type']

data

	subject	Bob		Guido		Sue
	type	HR	Temp	HR	Temp	HR	Temp
year	visit
2013	1	55.0	37.4	47.0	39.2	56.0	36.0
2013	2	47.0	36.8	36.0	37.4	38.0	38.5
2014	1	45.0	37.1	41.0	37.3	52.0	36.8
2014	2	40.0	36.1	11.0	37.7	46.0	36.3

# 此方法即将被废弃
data.mean(level='year')

/var/folders/6b/77kzlkh11f959wbt2yf57jhm0000gn/T/ipykernel_70439/1113942518.py:2: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.median(level=1) should use df.groupby(level=1).median().
  data.mean(level='year')

subject	Bob		Guido		Sue
type	HR	Temp	HR	Temp	HR	Temp
year
2013	51.0	37.1	41.5	38.3	47.0	37.25
2014	42.5	36.6	26.0	37.5	49.0	36.55

# 等价方法
data.groupby(level=0).mean()

subject	Bob		Guido		Sue
type	HR	Temp	HR	Temp	HR	Temp
year
2013	51.0	37.1	41.5	38.3	47.0	37.25
2014	42.5	36.6	26.0	37.5	49.0	36.55

# 此方法即将被废弃
data.mean(level='type',axis=1)

/var/folders/6b/77kzlkh11f959wbt2yf57jhm0000gn/T/ipykernel_70439/3859318972.py:2: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.median(level=1) should use df.groupby(level=1).median().
  data.mean(level='type',axis=1)

	type	HR	Temp
year	visit
2013	1	52.666667	37.533333
2013	2	40.333333	37.566667
2014	1	46.000000	37.066667
2014	2	32.333333	36.700000

# 等价方法
# axis=1表示聚合维度在「列」
# level=1表示聚合的字段是「type」
data.groupby(level=1, axis=1).mean()

	type	HR	Temp
year	visit
2013	1	52.666667	37.533333
2013	2	40.333333	37.566667
2014	1	46.000000	37.066667
2014	2	32.333333	36.700000