7 minute read

In this post we will learn how to concatenate two or more string columns of a dataframe. You can use either + operator or the str.cat() function to combine two or more column values

We will also see how to combine columns based on conditions, Also how to combine columns with Nulls or NaN values.

Concatenate multiple column values into one column

Let’s create a dataframe with two text columns author and book_name

import numpy as np
import pandas as pd


data = {'book_id': [1, 2, 3, 4, 5],
         'author': ['J.R.R. Tolkien', 'J.D. Salinger', np.nan, ' F. Scott Fitzgerald', np.nan],
         'book_name': ['The Lord of the Rings', 'The Catcher in the Rye', 'To Kill a Mockingbird', np.nan, np.nan]
        }

df = pd.DataFrame(data)

df

This is how our dataframe looks like

  book_id author book_name
0 1 J.R.R. Tolkien The Lord of the Rings
1 2 J.D. Salinger The Catcher in the Rye
2 3 NaN To Kill a Mockingbird
3 4 F. Scott Fitzgerald NaN
4 5 NaN NaN

Let’s combine the author and book_name column of the above dataframe.

There are two ways you can combine two or more text columns into a single column.

1. + operator (plus operator)

df['book_details'] = df['author'] + df['book_name']

Note: + operator performs addition in case of two number values.

2. pandas.series.str.cat() function:

df["book_details"] = df["author"].str.cat( df["book_name"] )

Parameters of str.cat()

others: Series, Index, DataFrame, np.ndarray or list-like. When others is None, the method returns the concatenation of all strings in the calling series/index.

sep: default sep is set to ‘’ i.e. no space

na_rep: to replace null values; default is set to None

join: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
There are 4 attributes to join. This determines the join-style between the calling series/index/dataframe.

Note: + Operator is convenient when you are working with a large dataset whereas series.str.cat() is used when a dataset is relatively small.

Here’s how the dataframe looks after concatenating the two text columns.

Both the above code gives the same output.

  book_id author book_name book_details
0 1 J.R.R. Tolkien The Lord of the Rings J.R.R. TolkienThe Lord of the Rings
1 2 J.D. Salinger The Catcher in the Rye J.D. SalingerThe Catcher in the Rye
2 3 NaN To Kill a Mockingbird NaN
3 4 F. Scott Fitzgerald NaN NaN
4 5 NaN NaN NaN

The default separator set in both cases ‘’ i.e. no space. We will learn more about separators next.

Combine non-string columns

In this section we will learn how to combine a string and a non-string value in pandas, we will use the astype(str) and map(str) methods.

The .astype(str) converts the column value into string, map(str) does the same.

Null/NaN values are considered float so, to combine columns with null values we have to first change their datatype.

.astype(str)

.astype(str) is used to cast a pandas object to a specified datatype (dtype).

df['book_details'] = df['book_id'].astype(str) + '-' + df['author'].astype(str) + '-' + df['book_name'].astype(str)

.map(str)

.map(str) is used for substituting each value in a series with another value, that may be derived from a function, a dict, or a series.

df['book_details'] = df['book_id'].map(str) + '-' + df['author'].map(str) + '-' + df['book_name'].map(str)

Both astype(str) and map(str) give the same output in our case.

book_details
1-J.R.R. Tolkien-The Lord of the Rings
2-J.D. Salinger-The Catcher in the Rye
3-nan-To Kill a Mockingbird
4- F. Scott Fitzgerald-nan
5-nan-nan

join()

join() function is used to join strings. It is used paired with other concatenating functions.

.agg()

.agg() is used to join multiple string columns. It combines all the columns as a list.

df['book_details'] = df[['author', 'book_name']].fillna('None').agg('-'.join, axis=1)

axis: 0 or index; default, 1 or column

.apply() and .join()

apply() is to apply function when combining two or more columns into a single column. It is used to apply another function on a specific axis.

df['book_details'] = df[['author', 'book_name']].fillna('None').apply("-".join, axis=1)

.apply() with lambda

apply() with lambda can be used to achieve the same with any column slice of your dataframe.

df['book_details'] = df[['author', 'book_name']].fillna('None').apply(lambda x: "-".join(x), axis =1)

All three functions give the same output

book_details
J.R.R. Tolkien-The Lord of the Rings
J.D. Salinger-The Catcher in the Rye
None-To Kill a Mockingbird
F. Scott Fitzgerald-None
None-None

Combine columns with conditions

Let’s combine the columns based on conditions, We want to combine the columns where any of the columns doesn’t have a NaN value in it.

We are using np.where. You can read more about it in this article.

df['book_details'] = np.where(((df.author.notnull())&(df.book_name.notnull())),(df['author'].astype(str)+df['book_name'].astype(str)),np.nan)

We’ve combined only the rows where both author and book_name field has values otherwise the book_details has a NaN value

book_details
J.R.R. TolkienThe Lord of the Rings
J.D. SalingerThe Catcher in the Rye
NaN
NaN
NaN

Combine columns with nulls/nans

In this section, we will learn ways of combining string columns with null values.

One way is by using str.cat() parameter. And the other is by using fillna().

This dataframe has some missing values in the second and third columns.

We will learn how to combine a value + null and null + null string columns in this section.

Normally when two string columns with null values are combined it doesn’t add any column. The output is NaN.

na_rep parameter

One way to replace null values is by using na_rep parameter of str.cat() function.

We can assign na_rep with the string value and replace all the null values and combine the columns.

df["book_details"] = df['author'].str.cat(df['book_name'], sep = '-', na_rep = 'None')

All the NaN values have been replaced by string values and concatenated.

book_details
J.R.R. Tolkien-The Lord of the Rings
J.D. Salinger-The Catcher in the Rye
None-To Kill a Mockingbird
F. Scott Fitzgerald-None
None-None

fillna()

fillna() replaces all null values with a specified string. It is useful when the data is large and unkempt.

fillna() parameters are useful to target specific columns or rows. It also has different methods to replace the null values.

DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)

fillna() - string

df["book_details"] = df["author"].fillna('None') + ", " + df["book_name"].fillna('None')
book_details
J.R.R. Tolkien, The Lord of the Rings
J.D. Salinger, The Catcher in the Rye
None, To Kill a Mockingbird
F. Scott Fitzgerald, None
None, None

fillna() - column data

df["name"] = df["first_name"].fillna('None') + " " + df["last_name"].fillna(df['first_name'])
book_details
J.R.R. Tolkien, The Lord of the Rings
J.D. Salinger, The Catcher in the Rye
None, To Kill a Mockingbird
F. Scott Fitzgerald, F. Scott Fitzgerald
NaN

fillna() with method

Two methods for the filling of value:

pad/ffill: forward fill and bfill/backfill: back fill

default is set to None

df["book_details"] = df["author"].fillna(method = 'bfill') + ", " + df["book_name"].fillna(method = 'ffill')
book_details
J.R.R. Tolkien, The Lord of the Rings
J.D. Salinger, The Catcher in the Rye
F. Scott Fitzgerald, To Kill a Mockingbird
F. Scott Fitzgerald, To Kill a Mockingbird
NaN

fillna() with limit

df["book_details"] = df["author"].fillna(method = 'bfill', limit = 1) + ", " + df["book_name"].fillna('None', limit = 1)
book_details
J.R.R. Tolkien, The Lord of the Rings
J.D. Salinger, The Catcher in the Rye
F. Scott Fitzgerald, To Kill a Mockingbird
F. Scott Fitzgerald, None
NaN

Combine columns with separators/delimiters

Separators/Delimiters are used to separate the different column values in a row. Separators are also called Delimiters.

They are always treated as strings and enclosed within quotes. space is the default value of the Separator parameter

Following are different separators used in pandas:

  • default ‘’ (no space)
  • space ‘ ‘
  • comma ‘ , ‘
  • hyphen ‘ - ‘
  • pipe ‘ | ‘
  • colon ‘ : ‘
  • semi-colon ‘ ; ‘
# using + operator
df['book_details'] = df['author'] + '-' + df['book_name']

# using str.cat()
df['book_details'] = df['author'].str.cat(df['book_name'], sep = '-')

Both of the above codes will give the same output.

  book_id author book_name book_details
0 1 J.R.R. Tolkien The Lord of the Rings J.R.R. Tolkien-The Lord of the Rings
1 2 J.D. Salinger The Catcher in the Rye J.D. Salinger-The Catcher in the Rye
2 3 NaN To Kill a Mockingbird NaN
3 4 F. Scott Fitzgerald NaN NaN
4 5 NaN NaN NaN

Here are a few examples of how it looks:

Separator set to comma ,

# using + operator
df['book_details'] = df['author'] + ', ' + df['book_name']

# using str.cat()
df['book_details'] = df['author'].str.cat(df['book_name'], sep = ', ')
book_details
J.R.R. Tolkien, The Lord of the Rings
J.D. Salinger, The Catcher in the Rye
NaN
NaN
NaN

Separator set to semi-colon ;

# using + operator
df['book_details'] = df['author'] + '; ' + df['book_name']

# using str.cat()
df['book_details'] = df['author'].str.cat(df['book_name'], sep = '; ')
book_details
J.R.R. Tolkien; The Lord of the Rings
J.D. Salinger; The Catcher in the Rye
NaN
NaN
NaN

Separator set to space

# using + operatorf
df['book_details'] = df['author'] + ' ' + df['book_name']

# using str.cat()
df['book_details'] = df['author'].str.cat(df['book_name'], sep = ' ')
book_details
J.R.R. Tolkien The Lord of the Rings
J.D. Salinger The Catcher in the Rye
NaN
NaN
NaN

Tags: ,

Categories: ,

Updated: