Skip to content

Python | Pandas | Concat & Merge

Posted on:February 18, 2019

将不同数据源进行合并:

Table of contents

1. Concat

# 定义一个能够返回指定Dataframe的函数
# 参数分别为列名与行索引
def make_df(cols, ind):
    data = {c:[str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)
# Test
make_df('ABC', range(3))
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

1.1 简易合并

类似于np.concatenate([...])

# 合并两个Series
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['D','E','F'], index=[4,5,6])
pd.concat([ser1,ser2])
1    A
2    B
3    C
4    D
5    E
6    F
dtype: object
# 合并时保留重复索引
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['D','E','F'], index=[1,2,3])
pd.concat([ser1,ser2])
1    A
2    B
3    C
1    D
2    E
3    F
dtype: object
# 合并Dataframe,默认逐行进行,类似vstack
df1 = make_df('AB',[1,2])
df2 = make_df('AB',[3,4])
print(df1, end='\n============\n');
print(df2, end='\n============\n');
print(pd.concat([df1, df2]))
    A   B
1  A1  B1
2  A2  B2
============
    A   B
3  A3  B3
4  A4  B4
============
    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4
# 设置合并坐标轴
# axis=1等价于axis=‘col’
df1 = make_df('AB',[1,2])
df2 = make_df('CD',[1,2])
print(df1, end='\n============\n');
print(df2, end='\n============\n');
print(pd.concat([df1, df2], axis=1))
    A   B
1  A1  B1
2  A2  B2
============
    C   D
1  C1  D1
2  C2  D2
============
    A   B   C   D
1  A1  B1  C1  D1
2  A2  B2  C2  D2

1.2 索引重复

# 重复的行索引被保留
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
pd.concat([x,y])
A B
0 A0 B0
1 A1 B1
0 A0 B0
1 A1 B1

1) verify_inyegrity:抛出异常

# 设置verify_inyegrity为True,合并时若有索引重复会触发异常
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
try:
    pd.concat([x,y], verify_integrity=True)
except Exception as e:
    print(e)
Indexes have overlapping values: Int64Index([0, 1], dtype='int64')

2) 忽略索引

# 设置ignore_index为True,合并时创建新的整数索引
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
pd.concat([x,y], ignore_index=True)
A B
0 A0 B0
1 A1 B1
2 A0 B0
3 A1 B1

3) 增加多级索引

# 设置keys参数为数据源设置多级索引标签
x = make_df('AB',[0,1])
y = make_df('AB',[0,1])
pd.concat([x,y], keys=['x','y'])
A B
x 0 A0 B0
1 A1 B1
y 0 A0 B0
1 A1 B1

1.3 交并集

合并数据时,数据不一定具有一模一样相同的列名,只会有一些列名相同:

df1 = make_df('ABC', [1,2])
df2 = make_df('BCD', [3,4])
pd.concat([df1,df2])
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4

默认的合并方式是对所有输入列进行并集合并,即join='outer';修改join='inner'实现对所有输入列的交集合并

pd.concat([df1,df2], join='inner')
B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4

1.4 append方法

2. Merge

2.1 合并类型

三种合并类型:one-to-one, many-to-one, many-to-many

1) one-to-one: 一对一

“左边的”和“右边的”key都是各异的:

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', df1, 'df2', df2)
'df1'
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
'df2'
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
df3 = pd.merge(df1, df2)
display('df3', df3)
'df3'
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

pd.merge函数识别出,df1与df2都含有一个名为employee的列,因此pandas自动将其作为连接的key。

2) many-to-one: 多对一

“左边的”key列有重复值,“右边的”key列是各异的。

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', df3, 'df4', df4)
'df3'
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
'df4'
group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve
display('pd.merge(df3, df4)', pd.merge(df3, df4))
'pd.merge(df3, df4)'
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve

3) many-to-many: 多对多

“左边的”key列有重复值,“右边的”key列也有重复值。

df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display("df1",df1,"df5",df5)
'df1'
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
'df5'
group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization
display("pd.merge(df1,df5)", pd.merge(df1,df5))
'pd.merge(df1,df5)'
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization

But in practice, datasets are rarely as clean as the one we’re working with here.

2.2 合并的键

pd.merge() looks for one or more matching column names between the two inputs, and uses this as the key.

该函数自动寻找两个输入间相同的列名,并将其作为key列。但有时候,列名的匹配并不是那么完美,例如:有多个相互匹配的key列,理想的两个key列名称不同等。

需要使用merge的更多参数进行设置。

1) on

Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names.

可以使用on参数,指定特定的列作为key。

display('df1', df1, 'df2', df2)
display("pd.merge(df1, df2, on='employee')", pd.merge(df1, df2, on='employee'))
'df1'
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
'df2'
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
"pd.merge(df1, df2, on='employee')"
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

2) left_on & right_on

At times you may wish to merge two datasets with different column names. In this case, we can use the left_on and right_on keywords to specify the two column names.

两个key列的名称不同,使用left_onright_on指明列名。

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
result = pd.merge(df1, df3, left_on='employee', right_on='name')
display('df1', df1, 'df3', df3,
        "pd.merge(df1, df3, left_on='employee', right_on='name')", result)
'df1'
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
'df3'
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
"pd.merge(df1, df3, left_on='employee', right_on='name')"
employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000

3) left_index & right_index

Sometimes you would instead like to merge on an index.

有时候会直接使用索引进行合并。

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', df1a, 'df2a', df2a)
'df1a'
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
'df2a'
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
display(pd.merge(df1a, df2a, left_index=True, right_index=True))
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
# 或者直接调用dataframe对象的join方法,默认按照索引进行合并
df1a.join(df2a)
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014

If you’d like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior. 可以混合使用索引和列名。

pd.merge(df1a, df3, left_index=True, right_on='name')
group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000

2.3 Specifying Set Arithmetic for Joins: 合并规则

The type of set arithmetic used in the join: this comes up when a value appears in one key column but not the other. 一个值在一侧的key列中存在,但却不存在于另一侧的key列中。

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6',df6)
display('df7',df7)
'df6'
name food
0 Peter fish
1 Paul beans
2 Mary bread
'df7'
name drink
0 Mary wine
1 Joseph beer
pd.merge(df6, df7)
name food drink
0 Mary bread wine

1) inner join: 取交集

上述合并的例子中,df6和df7以各自的name列为key,但是它们的key列中只有一个元素是相同的。

By default, the result contains the intersection of the two sets of inputs, which is known as an inner join.

默认,取交集,即inner join

We can specify this explicitly using the how keyword, which defaults to “inner”.

设置how参数处理这种情况,默认是inner

pd.merge(df6, df7, how='inner')
name food drink
0 Mary bread wine

2) outer join: 取并集

An outer join returns a join over the union of the input columns, and fills in all missing values with NAs.

pd.merge(df6, df7, how='outer')
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer

3) left (outer) join

The left join return joins over the left entries: The output rows now correspond to the entries in the left input.

pd.merge(df6, df7, how='left')
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
pd.merge(df7, df6, how='left')
name drink food
0 Mary wine bread
1 Joseph beer NaN

4) right (outer) join

pd.merge(df6, df7, how='right')
name food drink
0 Mary bread wine
1 Joseph NaN beer

2.4 Overlapping Column Names: suffixes参数

Maybe your two input DataFrames have conflicting column names.

两个数据有重复的列名(除了key列外)。

df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
display('df8', df8)
display('df9', df9)
'df8'
name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4
'df9'
name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2
pd.merge(df8, df9, on='name')
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2

The merge function automatically appends a suffix _x or _y to make the output columns unique.

If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword:

# 自定义后缀名
pd.merge(df8, df9, on='name', suffixes=['_1','_2'])
name rank_1 rank_2
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2