Skip to content

Python | Pandas | Missing Data

Posted on:February 18, 2019

1. NaN

但对Pandas来说,None和NaN“基本”等价:

data = pd.Series([1, None, np.nan])
data
0    1.0
1    NaN
2    NaN
dtype: float64

2. 发现

返回布尔掩码。

# 数据准备
np.random.seed(23)
data = pd.DataFrame(np.random.randint(0,100,(4,5)))
data[data<50] = np.nan
data
0 1 2 3 4
0 83 NaN 73.0 54.0 NaN
1 76 91.0 NaN 90.0 NaN
2 51 NaN NaN NaN NaN
3 66 75.0 85.0 69.0 64.0
data.isnull()
0 1 2 3 4
0 False True False False True
1 False False True False True
2 False True True True True
3 False False False False False
data.notnull()
0 1 2 3 4
0 True False True True False
1 True True False True False
2 True False False False False
3 True True True True True

3. 剔除

.dropna(),对Series很简单,但对DataFrame需要考虑:

默认情况:

可以自制:

4. 填充

.fillna(),返回副本。

# 数据准备
np.random.seed(0)
data = pd.DataFrame(np.random.randint(0,100,(4,5)))
data = data[data>50]
data
0 1 2 3 4
0 NaN NaN 64.0 67.0 67
1 NaN 83.0 NaN NaN 87
2 70.0 88.0 88.0 NaN 58
3 65.0 NaN 87.0 NaN 88
# 用特定值填充缺失值
data.fillna(-1)
0 1 2 3 4
0 -1.0 -1.0 64.0 67.0 67
1 -1.0 83.0 -1.0 -1.0 87
2 70.0 88.0 88.0 -1.0 58
3 65.0 -1.0 87.0 -1.0 88
# 设置method参数为ffill,代表forword-fill
# 表示用该NaN"之前的"有效值填充,默认情况是按”列“,即“上面的前面”,此时默认参数axis=0
# 注意,从前往后填充,若前面没有值,则还为NaN;若有值但为NaN,再往前推一个,直到没有值或找到有效值
data.fillna(method='ffill')
0 1 2 3 4
0 NaN NaN 64.0 67.0 67
1 NaN 83.0 64.0 67.0 87
2 70.0 88.0 88.0 67.0 58
3 65.0 88.0 87.0 67.0 88
# 设置axis=1按“行“:
data.fillna(method='ffill', axis=1)
0 1 2 3 4
0 NaN NaN 64.0 67.0 67.0
1 NaN 83.0 83.0 83.0 87.0
2 70.0 88.0 88.0 88.0 58.0
3 65.0 65.0 87.0 87.0 88.0