4 minute read

In this post we will discuss how to split a dataframe string into multiple columns and also split string with single and multiple delimiters, The most useful pandas API’s for splitting the string with delimiters into multiple columns are as follows:

  • Pandas.series.str.split()
  • Pandas.series.str.extract()
  • Pandas.series.str.partition()

We could use string or regular expression in the first two API’s to split the string and return a Series or Index unless it’s explicitly set expand=True, whereas for the str.partition() api we could use to split the string into two equal parts

Split string into columns with a single delimiter

Let’s start with splitting the string in the column by delimiter Comma(,)

df=pd.DataFrame({
                   'storage_area':
                  [
                   '123,Green Street, Cincinnati', 
                   '9854,Oak Street', 
                   '3129,Main Street, Jackson', 
                   '4213,Bridge Ave, Trenton',
                   '09865,Rose valley, Brunswick'
                  ]
                })
df

We have a data frame with one (string) column and We’d like to split it into three (string) columns, with first column header as ‘HouseNo', second column header as ‘Street' and the other as ‘City'

  storage_area
0 123,Green Street, Cincinnati
1 9854,Oak Street
2 3129,Main Street, Jackson
3 4213,Bridge Ave, Trenton
4 09865,Rose valley, Brunswick

First we will split the column into a list of strings, For a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough. It operates on a column (Series) of strings, and returns a column (Series) of lists

df.storage_area.str.split(',',expand=False)
   
0 [123, Green Street, Cincinnati]
1 [9854, Oak Street,]
2 [3129, Main Street, Jackson]
3 [4213, Bridge Ave, Trenton]
4 [09865, Rose valley, Brunswick]

Getting a DataFrame out of splitting a column of strings is so useful that the str.split() method can do it for you with the expand=True parameter, It handles nicely by placing None in the columns for which there aren’t enough “splits”:

df.val.str.split(',',expand=True).add_prefix('Address_')
  Address_0 Address_1 Address_2
0 123 Green Street Cincinnati
1 9854 Oak Street None
2 3129 Main Street Jackson
3 4213 Bridge Ave Trenton
4 09865 Rose Valley Brunswick

String split multiple delimiters

We want to split a string by either a ‘;’ or ‘, ‘ That is, it has to be either a semicolon or a comma.

df=pd.DataFrame({
                 'storage_area':
                [
                 '123;Green Street, Cincinnati', 
                 '9854;Oak Street, Clifton', 
                 '3129;Main Street, Jackson', 
                 '4213;Bridge Ave, Trenton',
                 '09865;Rose valley, Brunswick'
                ]
                })

Let’s create a dataframe with string column and components of address separated wither by semicolon(;) or a comma(,). The semicolon or comma could have a leading or trailing whitespaces.

  storage_area
0 123;Green Street, Cincinnati
1 9854;Oak Street, Clifton
2 3129;Main Street, Jackson
3 4213;Bridge Ave, Trenton
4 09865;Rose valley, Brunswick

We can use str split with Regex OR condition to match multiple delimiters as shown below and set expand to True to expand the split strings into separate column.

We’ve added a prefix for each column

df.val.str.split(',|;',expand=True).add_prefix('address_')
  Address_0 Address_1 Address_2
0 123 Green Street Cincinnati
1 9854 Oak Street Clifton
2 3129 Main Street Jackson
3 4213 Bridge Ave Trenton
4 09865 Rose Valley Brunswick

Using Regex to split the string into columns

Here is a unique case where we would like to split on white space such that each string split into these three columns HouseNo, Street and City respectively

df=pd.DataFrame({
                 'storage_area':
                [
                 '123 Green Street Cincinnati', 
                 '9854 Oak Street Clifton', 
                 '3129 Main Street Jackson', 
                 '4213 Bridge Ave Trenton',
                 '09865 Rose valley Brunswick'
                ]
                })
df

The strings in the column are just separated by the whitespaces, we could not use the split and pass whitespace as separator because that would split into all the whitespaces, whereas we just want to split this into three columns. So what’s the solution?

  storage_area
0 123 Green Street Cincinnati
1 9854 Oak Street Clifton
2 3129 Main Street Jackson
3 4213 Bridge Ave Trenton
4 09865 Rose valley Brunswick

In such cases we could use the Regex

df.storage_area.str.
extract('(?P<HouseNo>\d{1,5}) \
         ((?P<Street>.*\s)(?P<City>[a-zA-Z ]*$))')

We’ve used Named groups that will become column names in the result.

Let’s understand this long Regex:

  1. (?P\<HouseNo>\d{1,5}) matches the any digit whose length is between 1 to 5
  2. ((?P\<Street>.\*\s) Matches anything else(.*) until a whitespace in between and names this Street
  3. (?P\<City>[a-zA-Z ]\*$) Matches any number (*) of lower and upper letters or spaces and names this City
  HouseNo 1 Street City
0 123 Green Street Cincinnati Green Street Cincinnati
1 9854 Oak Street Clifton Oak Street Clifton
2 3129 Main Street Jackson Main Street Jackson
3 4213 Bridge Ave Trenton Bridge Ave Trenton
4 09865 Rose valley Brunswick Rose valley Brunswick

Split string into two columns

Pandas str.partition() function is useful when we want to split the string at the first occurrence of separator.

Excerpt from the documentation:

This method splits the string at the first occurrence of sep, and returns 3 elements containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return 3 elements containing the string itself, followed by two empty strings.

df=pd.DataFrame({
                  'storage_area':
                 [
                  '1234 Green Street,Cincinnati', 
                  '9854 Oak Street,Clifton', 
                  '3129 Main Street,Jackson', 
                  '4213 Bridge Ave,Trenton',
                  '0986 Rose valley,Brunswick'
                 ]
                })
df
  storage_area
0 1234 Green Street,Cincinnati
1 9854 Oak Street,Clifton
2 3129 Main Street,Jackson
3 4213 Bridge Ave,Trenton
4 0986 Rose valley,Brunswick

We will split the strings on whitespace and rename the two columns

df['storage_area'].
str.partition(' ')[[0, 2]].
rename({0: 'HouseNo', 2: 'Address'}, axis=1)
  HouseNo Address
0 123 Green Street,Cincinnati
1 9854 Oak Street,Clifton
2 3129 Main Street,Jackson
3 4213 Bridge Ave,Trenton
4 09865 Rose valley,Brunswick

Tags: ,

Categories: ,

Updated: