Analysis of London spending with python

In this post I want to discuss how you can use python to fetch data from the internet,put them in a readable format and gain some interesting insights.

This exercise is motivated by “Using SQL for Lightweight Data Analysis” by Rufus Pollock. Here, I extend Rufus’ analysis to a larger dataset and I use different analysis tools.

The data

The data come from the “London GLA spending” website, where GLA stands for Greater London Authority. Every month GLA publishes their spendings on Housing Services, Developing, Communities & Intelligence, etc. While writing, the GLA webpage contains 38 csv files with inhomogeneous formatting, empty columns and irregularly spaced data. To complicate things, the GLA website keeps changing address and html design. So, I do not guarantee that the code described below will work in the future.

The webpage looks like this:


where we are interested in the content of the CSV file column.

The strategy is straightforward:

  • scrap the html code of the GLA webpage;
  • extract the links to the .csv files;
  • download all the data and append the results to a pandas dataframe;
  • clean the data
  • have some fun with the data

The code

We need some standard libraries like numpy and matplotlib, plus some less standard libraries, with BeautifulSoup being the most important, because it allows to manipulate the html code hiding behind web pages. If you do not have BeautifulSoup installed, you can get it via pip install BeautifulSoup.

import pandas as pd
from bs4 import BeautifulSoup
import urllib2
import matplotlib.pylab as plt
import re
import numpy as np
from __future__ import print_function
%matplotlib inline

The html code from the GLA webpage can be extracted with:

wpage= ''

req = urllib2.Request(wpage)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, 'html5lib')

The csv files that we need are contained in <td> tags, which are nested inside

tags. Some <td> tags contain the direct link to the csv file, while other <td> tags contain a link to another webpage which contains the csv file. It is rather confusing, but it can be implemented very easily with python:

table = soup.find_all('table') # Find all tables
thelist = []

for t in table:
    if len(t.find_all('th')) > 0: # Only select tables with csv files
       for a in t.find_all('a', href=True): # Find all hyperlinks in the table
       thelink = 'https:' + a['href']
       if len(thelink) < 40: # If True, thelink is a link to another webpage
                                # containing the csv file
              req = urllib2.Request(thelink) # Scrap thelink wepage
              page = urllib2.urlopen(req)
              soup = BeautifulSoup(page, 'html5lib')

              aa = soup.find_all(href = re.compile('.csv'))[0] # Extract the csv file
              thelink = aa['href']
       else: # If the link is a link to the csv file, append the
              thelist.append(thelink) # results straight away

thelist is a list which contains all the direct links to the csv files. Note that we have not downloaded the data yet.

print('the list contains', len(thelist), 'csv files')

# the list contains 38 csv files

# [u'',
#  u'',
#  u'',
#  u'',
#  u'']

We can now download the data. Instead of downloading every csv files to disk, one can use pandas ability to read csv files straight from the internet. Before we do that, let’s see how the head of a csv file looks like in excel:


We want to discard the information stored in the first few lines. The actual data start from where columns names are: Vendor ID, Vendor Name, Cost Element, etc..

Feature engineering

The following lines of code loop through the csv files stored in thelist, discard the data in the file head, and append the results to the the dataframe df:

df = pd.DataFrame()
for i, thefile in enumerate(thelist):
     print(i, end="   "),
     tmp = pd.read_csv(thefile, header=None)

     # Drop rows with all missing values
     tmp.dropna(inplace=True, how='all',axis=1, thresh=10)

     # Find the row with the column names
     ix = np.where(tmp.values == 'Vendor ID')[0][0]
     column_names = tmp.loc[ix]

     # Remove summary from file header
     tmp = tmp[(ix + 1) :]
     tmp.columns = column_names

     # Drop columns with all missing values
     tmp.dropna(inplace=True, how='all', axis=0)
     tmp.dropna(inplace=True, axis = 0)

     # Append results to dataframe
     df = df.append(tmp, ignore_index = True)

Here is how the data look like:

print('The datafarme has ', len(df), 'rows')
nan nan Amount Clearing Date Cost Element Directorate Document No Expenditure Account Code Description Service Expenditure Analysis Vendor ID Vendor Name
NaN NaN 66,253,087.00 24 Feb 2016 544071 RESOURCES CHAPS649 FUNCTIONAL BODY GRANT PAYMENT Highways and transport services 10016524 TRANSPORT FOR LONDON
NaN NaN 20,945,312.00 15 Feb 2016 544093 RESOURCES CHAPS627 NLE – GRANT PMT TO TFL Highways and transport services 10016524 TRANSPORT FOR LONDON
NaN NaN 17,926,156.00 22 Feb 2016 544073 RESOURCES CHAPS643 BUSINESS RATE RETENTION-CLG Highways and transport services NC DCLG

Note that there are still some missing column. The Amount column is actually a string that we need to transform to float. ‘Clearing Date’ is also a string, that we want to transform to datetime format. Some columns contain strings sometimes upper cases and sometimes lower cases, which is an issue when we will use the groupby function. It is a good idea to make all strings upper cases. Unfortunately, Directorate contains missing values for some rows and there is nothing we can do about it. Lastly, we can drop Cost Element, Vendor ID and Document No because irrelevant for our analysis:

df = df[df.columns[2:]] # Remove the first two empty columns

# Amount
def clean_par(text):
    if '(' in text:
        output = ('-' + re.sub('[()]','',text))
        output = text
    return output

df['Amount'] = df['Amount'].map(lambda x: clean_par(x)) # transform '(123)' to -123
df['Amount'] = df['Amount'].map(lambda x: x.replace(',','')).astype(float) # Remove ','

# Transform to time series
df['Clearing Date'] = df['Clearing Date'].map(lambda x: pd.Timestamp(x))

# Make upper case
df['Expenditure Account Code Description'] = df['Expenditure Account Code Description'].map(lambda x: x.upper())

mask = ~df['Directorate'].isnull() # consider only non-null values
df.loc[mask,'Directorate'] = df.loc[mask,'Directorate'].map(lambda x: x.upper()) # Make upper case
df.loc[mask,'Directorate'] = df.loc[mask,'Directorate'].map(lambda x: x.replace('&','AND'))
df.loc[mask,'Directorate'] = df.loc[mask,'Directorate'].map(lambda x: x.rstrip()) # Strip white spaces
df.loc[mask,'Directorate'] = df.loc[mask,'Directorate'].map(lambda x: x.lstrip()) # Strip white spaces
df.loc[mask,'Directorate'] = df.loc[mask,'Directorate'].map(lambda x: x.replace('COMMUNIITES AND INTELLIGENCE','COMMUNITIES AND INTELLIGENCE'))

df.drop_duplicates(inplace=True) # Drop eventual duplicates
df.drop(['Cost Element','Vendor ID','Document No'], axis=1, inplace=True)


With a clean dataset we can finally gain some insights. We can plot expenditures as a function of time:

df.sort_values('Clearing Date').plot(x = 'Clearing Date', y = 'Amount',legend=None,
    grid=True,lw=1.2, figsize=(10,6))


which shows that expenditures boosted after August 2015. For the record, 1e8 means 10^8 = 100,000,000 = 100 million pounds.

The most expensive payments were all devoted to the Transport for London

Amount Clearing Date Directorate Expenditure Account Code Description Service Expenditure Analysis Vendor Name
118379401.6 2015-09-17 RESOURCES FUNCTIONAL BODY GRANT PAYMENT Highways and transport services TRANSPORT FOR LONDON
82079402.0 2015-07-28 RESOURCES FUNCTIONAL BODY GRANT PAYMENT Highways and transport services TRANSPORT FOR LONDON
71226538.0 2016-01-21 RESOURCES FUNCTIONAL BODY GRANT PAYMENT Highways and transport services TRANSPORT FOR LONDON
66253087.0 2016-02-24 RESOURCES FUNCTIONAL BODY GRANT PAYMENT Highways and transport services TRANSPORT FOR LONDON
45200000.0 2016-01-18 RESOURCES FUNCTIONAL BODY GRANT PAYMENT Highways and transport services TRANSPORT FOR LONDON

Which Expenditure Account Code Description received most money from GLA(top 5 rows only) ?

grouped = df.groupby('Expenditure Account Code Description')
Expenditure Account Code Description Amount

Which Vendor Name received most money from GLA (top 5 rows only) ?

grouped = df.groupby('Vendor Name')
Vendor Name Amount

Let’s make a pie chart showing the total expenditure between 2013-16 grouped by Directorate (only top 5 shown):

# Filter out rows with missing 'Directorate' and group the result
grouped = df[~df['Directorate'].isnull()].groupby('Directorate')

total = grouped.sum().sort_values('Amount',ascending=False) # Sort by 'amount'

# Reformat the 'Amount' column
float_formatter = lambda x: &amp;amp;quot;%.2f&amp;amp;quot; % x
format_amount = (total.Amount.values/10**6)
format_amount = [ '{:0.3f}'.format(x) for x in format_amount ]
labels = total.index.values + ' ( ' + format_amount + ' m' + u'\u00A3)'

# Show only first 5 labels
labels[5:] = ''

cmap =
colors = cmap(np.linspace(0., 2., len(total)))
explode = tuple(np.linspace(0,0.8,len(total)))

patches, texts = plt.pie(total['Amount'], radius=0.9, startangle=0,
                         colors=colors, explode=explode, labels=labels)
_ = [texts[i] for i in range(0,len(total))]

plt.title('GLA Total Spendings (2013-2016)', x=1.3,y=0.8, fontsize=15)


Amount spent for each directorate grouped by year. Note the log scale on the x-axis

df['year'] = map(lambda x: x.year, df['Clearing Date'])
grouped = df.groupby(['Directorate','year']).agg({'Amount':sum})
df2 = grouped.unstack()
df2.plot.barh(logx=True,figsize=(7, 7), stacked=False, legend=False)

labels = df2.columns.get_level_values(1).values
plt.legend(title='YEAR', loc=9, bbox_to_anchor=(1.2,1), labels=labels)

plt.xlabel(u'log (Amount) in \u00A3  ')


I do not know if DEVELOPMENT and ENVIRONMENT is the same as DEVELOPMENT, ENTERPRISE and ENVIRONMENT, or if HOUSING AND REGENERATION is a different name for HOUSING AND LAND. If so, their total Amount should be summed together. While writing, we are only half way through 2016, which explains why the expenditure in 2016 is generally less than expenditure in 2015.

Wrapping up

BeautifulSoup is a simple and powerful python library to extract information from web pages (html and xml files). I have shown how to use BeautifulSoup to extract .csv files from the Greater London Authority web page and read their content using pandas, which is by far the best library for data analysis with python.

GLA spent almost 1 billion of pounds (921,503,900 pounds to be exact) for Transport for London (good, very good…) from April 2013 to March 2016, followed by housing services and all the different London boroughs. About 40% of GLA budget goes to housing-related services.

You can find the original notebook here in my GitHub account.