将不同数据源进行合并:
Concat
:将两个数据集非常简单地拼接在一起(默认vstack式拼接)。
Merge
:像数据库语言那样连接与合并具有重叠字段的数据集。
Table of contents
1. Concat
# 定义一个能够返回指定Dataframe的函数
# 参数分别为列名与行索引
def make_df(cols, ind):
data = {c:[str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
# Test
make_df('ABC', range(3))
|
A |
B |
C |
0 |
A0 |
B0 |
C0 |
1 |
A1 |
B1 |
C1 |
2 |
A2 |
B2 |
C2 |
1.1 简易合并
类似于np.concatenate([...])
:
# 合并两个Series
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['D','E','F'], index=[4,5,6])
pd.concat([ser1,ser2])
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
# 合并时保留重复索引
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['D','E','F'], index=[1,2,3])
pd.concat([ser1,ser2])
1 A
2 B
3 C
1 D
2 E
3 F
dtype: object
# 合并Dataframe,默认逐行进行,类似vstack
df1 = make_df('AB',[1,2])
df2 = make_df('AB',[3,4])
print(df1, end='\n============\n');
print(df2, end='\n============\n');
print(pd.concat([df1, df2]))
A B
1 A1 B1
2 A2 B2
============
A B
3 A3 B3
4 A4 B4
============
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
# 设置合并坐标轴
# axis=1等价于axis=‘col’
df1 = make_df('AB',[1,2])
df2 = make_df('CD',[1,2])
print(df1, end='\n============\n');
print(df2, end='\n============\n');
print(pd.concat([df1, df2], axis=1))
A B
1 A1 B1
2 A2 B2
============
C D
1 C1 D1
2 C2 D2
============
A B C D
1 A1 B1 C1 D1
2 A2 B2 C2 D2
1.2 索引重复
# 重复的行索引被保留
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
pd.concat([x,y])
|
A |
B |
0 |
A0 |
B0 |
1 |
A1 |
B1 |
0 |
A0 |
B0 |
1 |
A1 |
B1 |
1) verify_inyegrity:抛出异常
# 设置verify_inyegrity为True,合并时若有索引重复会触发异常
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
try:
pd.concat([x,y], verify_integrity=True)
except Exception as e:
print(e)
Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
2) 忽略索引
# 设置ignore_index为True,合并时创建新的整数索引
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
pd.concat([x,y], ignore_index=True)
|
A |
B |
0 |
A0 |
B0 |
1 |
A1 |
B1 |
2 |
A0 |
B0 |
3 |
A1 |
B1 |
3) 增加多级索引
# 设置keys参数为数据源设置多级索引标签
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
pd.concat([x,y], keys=['x','y'])
|
|
A |
B |
x |
0 |
A0 |
B0 |
1 |
A1 |
B1 |
y |
0 |
A0 |
B0 |
1 |
A1 |
B1 |
1.3 交并集
合并数据时,数据不一定具有一模一样相同的列名,只会有一些列名相同:
df1 = make_df('ABC', [1,2])
df2 = make_df('BCD', [3,4])
pd.concat([df1,df2])
|
A |
B |
C |
D |
1 |
A1 |
B1 |
C1 |
NaN |
2 |
A2 |
B2 |
C2 |
NaN |
3 |
NaN |
B3 |
C3 |
D3 |
4 |
NaN |
B4 |
C4 |
D4 |
默认的合并方式是对所有输入列进行并集合并,即join='outer'
;修改join='inner'
实现对所有输入列的交集合并:
pd.concat([df1,df2], join='inner')
|
B |
C |
1 |
B1 |
C1 |
2 |
B2 |
C2 |
3 |
B3 |
C3 |
4 |
B4 |
C4 |
1.4 append方法
- Series对象和DataFrame对象都支持
append
方法,如df1.append(df2)
,效果和pd.concat([df1,df2])
一样。
- 但Pandas的
append
方法不直接更新原有对象的值,而是为合并后的数据创建一个新的对象——有点浪费空间。
2. Merge
2.1 合并类型
三种合并类型:one-to-one, many-to-one, many-to-many
1) one-to-one: 一对一
“左边的”和“右边的”key都是各异的:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', df1, 'df2', df2)
'df1'
|
employee |
group |
0 |
Bob |
Accounting |
1 |
Jake |
Engineering |
2 |
Lisa |
Engineering |
3 |
Sue |
HR |
'df2'
|
employee |
hire_date |
0 |
Lisa |
2004 |
1 |
Bob |
2008 |
2 |
Jake |
2012 |
3 |
Sue |
2014 |
df3 = pd.merge(df1, df2)
display('df3', df3)
'df3'
|
employee |
group |
hire_date |
0 |
Bob |
Accounting |
2008 |
1 |
Jake |
Engineering |
2012 |
2 |
Lisa |
Engineering |
2004 |
3 |
Sue |
HR |
2014 |
pd.merge
函数识别出,df1与df2都含有一个名为employee
的列,因此pandas自动将其作为连接的key。
2) many-to-one: 多对一
“左边的”key列有重复值,“右边的”key列是各异的。
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', df3, 'df4', df4)
'df3'
|
employee |
group |
hire_date |
0 |
Bob |
Accounting |
2008 |
1 |
Jake |
Engineering |
2012 |
2 |
Lisa |
Engineering |
2004 |
3 |
Sue |
HR |
2014 |
'df4'
|
group |
supervisor |
0 |
Accounting |
Carly |
1 |
Engineering |
Guido |
2 |
HR |
Steve |
display('pd.merge(df3, df4)', pd.merge(df3, df4))
'pd.merge(df3, df4)'
|
employee |
group |
hire_date |
supervisor |
0 |
Bob |
Accounting |
2008 |
Carly |
1 |
Jake |
Engineering |
2012 |
Guido |
2 |
Lisa |
Engineering |
2004 |
Guido |
3 |
Sue |
HR |
2014 |
Steve |
3) many-to-many: 多对多
“左边的”key列有重复值,“右边的”key列也有重复值。
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display("df1",df1,"df5",df5)
'df1'
|
employee |
group |
0 |
Bob |
Accounting |
1 |
Jake |
Engineering |
2 |
Lisa |
Engineering |
3 |
Sue |
HR |
'df5'
|
group |
skills |
0 |
Accounting |
math |
1 |
Accounting |
spreadsheets |
2 |
Engineering |
coding |
3 |
Engineering |
linux |
4 |
HR |
spreadsheets |
5 |
HR |
organization |
display("pd.merge(df1,df5)", pd.merge(df1,df5))
'pd.merge(df1,df5)'
|
employee |
group |
skills |
0 |
Bob |
Accounting |
math |
1 |
Bob |
Accounting |
spreadsheets |
2 |
Jake |
Engineering |
coding |
3 |
Jake |
Engineering |
linux |
4 |
Lisa |
Engineering |
coding |
5 |
Lisa |
Engineering |
linux |
6 |
Sue |
HR |
spreadsheets |
7 |
Sue |
HR |
organization |
But in practice, datasets are rarely as clean as the one we’re working with here.
2.2 合并的键
pd.merge()
looks for one or more matching column names between the two inputs, and uses this as the key.
该函数自动寻找两个输入间相同的列名,并将其作为key列。但有时候,列名的匹配并不是那么完美,例如:有多个相互匹配的key列,理想的两个key列名称不同等。
需要使用merge的更多参数进行设置。
1) on
Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names.
可以使用on
参数,指定特定的列作为key。
display('df1', df1, 'df2', df2)
display("pd.merge(df1, df2, on='employee')", pd.merge(df1, df2, on='employee'))
'df1'
|
employee |
group |
0 |
Bob |
Accounting |
1 |
Jake |
Engineering |
2 |
Lisa |
Engineering |
3 |
Sue |
HR |
'df2'
|
employee |
hire_date |
0 |
Lisa |
2004 |
1 |
Bob |
2008 |
2 |
Jake |
2012 |
3 |
Sue |
2014 |
"pd.merge(df1, df2, on='employee')"
|
employee |
group |
hire_date |
0 |
Bob |
Accounting |
2008 |
1 |
Jake |
Engineering |
2012 |
2 |
Lisa |
Engineering |
2004 |
3 |
Sue |
HR |
2014 |
2) left_on & right_on
At times you may wish to merge two datasets with different column names. In this case, we can use the left_on and right_on keywords to specify the two column names.
两个key列的名称不同,使用left_on和right_on指明列名。
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
result = pd.merge(df1, df3, left_on='employee', right_on='name')
display('df1', df1, 'df3', df3,
"pd.merge(df1, df3, left_on='employee', right_on='name')", result)
'df1'
|
employee |
group |
0 |
Bob |
Accounting |
1 |
Jake |
Engineering |
2 |
Lisa |
Engineering |
3 |
Sue |
HR |
'df3'
|
name |
salary |
0 |
Bob |
70000 |
1 |
Jake |
80000 |
2 |
Lisa |
120000 |
3 |
Sue |
90000 |
"pd.merge(df1, df3, left_on='employee', right_on='name')"
|
employee |
group |
name |
salary |
0 |
Bob |
Accounting |
Bob |
70000 |
1 |
Jake |
Engineering |
Jake |
80000 |
2 |
Lisa |
Engineering |
Lisa |
120000 |
3 |
Sue |
HR |
Sue |
90000 |
3) left_index & right_index
Sometimes you would instead like to merge on an index.
有时候会直接使用索引进行合并。
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', df1a, 'df2a', df2a)
'df1a'
|
group |
employee |
|
Bob |
Accounting |
Jake |
Engineering |
Lisa |
Engineering |
Sue |
HR |
'df2a'
|
hire_date |
employee |
|
Lisa |
2004 |
Bob |
2008 |
Jake |
2012 |
Sue |
2014 |
display(pd.merge(df1a, df2a, left_index=True, right_index=True))
|
group |
hire_date |
employee |
|
|
Bob |
Accounting |
2008 |
Jake |
Engineering |
2012 |
Lisa |
Engineering |
2004 |
Sue |
HR |
2014 |
# 或者直接调用dataframe对象的join方法,默认按照索引进行合并
df1a.join(df2a)
|
group |
hire_date |
employee |
|
|
Bob |
Accounting |
2008 |
Jake |
Engineering |
2012 |
Lisa |
Engineering |
2004 |
Sue |
HR |
2014 |
If you’d like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior. 可以混合使用索引和列名。
pd.merge(df1a, df3, left_index=True, right_on='name')
|
group |
name |
salary |
0 |
Accounting |
Bob |
70000 |
1 |
Engineering |
Jake |
80000 |
2 |
Engineering |
Lisa |
120000 |
3 |
HR |
Sue |
90000 |
2.3 Specifying Set Arithmetic for Joins: 合并规则
The type of set arithmetic used in the join: this comes up when a value appears in one key column but not the other. 一个值在一侧的key列中存在,但却不存在于另一侧的key列中。
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6',df6)
display('df7',df7)
'df6'
|
name |
food |
0 |
Peter |
fish |
1 |
Paul |
beans |
2 |
Mary |
bread |
'df7'
|
name |
drink |
0 |
Mary |
wine |
1 |
Joseph |
beer |
pd.merge(df6, df7)
|
name |
food |
drink |
0 |
Mary |
bread |
wine |
1) inner join: 取交集
上述合并的例子中,df6和df7以各自的name列为key,但是它们的key列中只有一个元素是相同的。
By default, the result contains the intersection of the two sets of inputs, which is known as an inner join.
默认,取交集,即inner join。
We can specify this explicitly using the how keyword, which defaults to “inner”.
设置how参数处理这种情况,默认是inner。
pd.merge(df6, df7, how='inner')
|
name |
food |
drink |
0 |
Mary |
bread |
wine |
2) outer join: 取并集
An outer join returns a join over the union of the input columns, and fills in all missing values with NAs.
pd.merge(df6, df7, how='outer')
|
name |
food |
drink |
0 |
Peter |
fish |
NaN |
1 |
Paul |
beans |
NaN |
2 |
Mary |
bread |
wine |
3 |
Joseph |
NaN |
beer |
3) left (outer) join
The left join return joins over the left entries: The output rows now correspond to the entries in the left input.
pd.merge(df6, df7, how='left')
|
name |
food |
drink |
0 |
Peter |
fish |
NaN |
1 |
Paul |
beans |
NaN |
2 |
Mary |
bread |
wine |
pd.merge(df7, df6, how='left')
|
name |
drink |
food |
0 |
Mary |
wine |
bread |
1 |
Joseph |
beer |
NaN |
4) right (outer) join
pd.merge(df6, df7, how='right')
|
name |
food |
drink |
0 |
Mary |
bread |
wine |
1 |
Joseph |
NaN |
beer |
2.4 Overlapping Column Names: suffixes参数
Maybe your two input DataFrames have conflicting column names.
两个数据有重复的列名(除了key列外)。
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', df8)
display('df9', df9)
'df8'
|
name |
rank |
0 |
Bob |
1 |
1 |
Jake |
2 |
2 |
Lisa |
3 |
3 |
Sue |
4 |
'df9'
|
name |
rank |
0 |
Bob |
3 |
1 |
Jake |
1 |
2 |
Lisa |
4 |
3 |
Sue |
2 |
pd.merge(df8, df9, on='name')
|
name |
rank_x |
rank_y |
0 |
Bob |
1 |
3 |
1 |
Jake |
2 |
1 |
2 |
Lisa |
3 |
4 |
3 |
Sue |
4 |
2 |
The merge function automatically appends a suffix _x or _y to make the output columns unique.
If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword:
# 自定义后缀名
pd.merge(df8, df9, on='name', suffixes=['_1','_2'])
|
name |
rank_1 |
rank_2 |
0 |
Bob |
1 |
3 |
1 |
Jake |
2 |
1 |
2 |
Lisa |
3 |
4 |
3 |
Sue |
4 |
2 |