Interactive Visualization of Australian Wine Ratings

Python

Interactive Visualization of Australian Wine Ratings Source – PbPython.com

Introduction

Over on Kaggle, there is an interesting data set of over 130K wine reviews
that have been scraped and pulled together into a single file. I thought this data
set would be really useful for showing how to build an interactive visualization
using Bokeh. This article will walk through how to build a Bokeh application that has
good examples of many of its features. The app itself is really helpful and
I had a lot of fun exploring this data set using the visuals. Additionally, this
application shows the power of Bokeh and it should give you some ideas as to how
you could use it in your own projects. Let’s get started by exploring the
“rich, smokey flavors with a hint of oak, tea and maple” that are embedded in this data set.

Data Overview

I will not spend much time walking through the data but if you are interested in
learning more about the data, what it contains and how it could be a useful tool
for further building out your skills, please check out the Kaggle page.

For this analysis, I chose to focus on only Australian wines. The decision to filter the data
was somewhat arbitrary but I found that it ended up being a large enough dataset
to make it interesting but not so large that performance was a problem on my middle-of-the-road laptop.

I made some minor cleanups and edits of the data which I won’t go through here but
all the changes are available in this notebook.

Here is a snapshot of the data we will explore in the rest of the article:

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery variety_color
77 Australia This medium-bodied Chardonnay features aromas … Made With Organic Grapes 86 18.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Yalumba 2016 Made With Organic Grapes Chardonn… Chardonnay Yalumba #440154
83 Australia Pale copper in hue, this wine exudes passion f… Jester Sangiovese 86 20.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz Mitolo 2016 Jester Sangiovese Rosé (McLaren Vale) Rosé Mitolo #450558
123 Australia The blend is roughly two-thirds Shiraz and one… Parson’s Flat 92 40.0 South Australia Padthaway NaN Joe Czerwinski @JoeCz Henry’s Drive Vignerons 2006 Parson’s Flat Shi… Shiraz-Cabernet Sauvignon Henry’s Drive Vignerons #460B5E
191 Australia From the little-known region of Padthaway, thi… The Trial of John Montford 87 30.0 South Australia Padthaway NaN Joe Czerwinski @JoeCz Henry’s Drive Vignerons 2006 The Trial of John… Cabernet Sauvignon Henry’s Drive Vignerons #471163
232 Australia Lifted cedar and pine notes interspersed with … Red Belly Black 85 12.0 South Australia South Australia NaN NaN NaN Angove’s 2006 Red Belly Black Shiraz (South Au… Shiraz Angove’s #471669

For this specific dataset, I approached the problem as an interested consumer, not as
a datascientist trying to build a predictive model. Basically, I want to have a
simple way to explore the data and find wines that might be interesting to purchase.
As a wine consumer, I’m mostly interested in price vs. ratings (aka points). An
interactive scatter plot should be a useful way to explore the data in more detail and
Bokeh is well suited for this kind of application.

To get your palette ready, here’s a small tasting of the app we’ll be building:

Wine Analysis

As a pun, it’s a bit on the dry side but I think it has a strong finish.

Bokeh

From the Bokeh site:

Bokeh is a Python interactive visualization library that targets modern web browsers
for presentation. Its goal is to provide elegant, concise construction of novel graphics
in the style of D3.js, and to extend this capability with high-performance interactivity
over very large or streaming datasets. Bokeh can help anyone who would like to quickly
and easily create interactive plots, dashboards, and data applications.

Bokeh has two methods for creating visualizations. The first approach is
to generate HTML documents that can be used standalone or embedded in a jupyter
notebook. The process for creating a plot is very similar to what you would do with
matplotlib or some other python visualization library. The key bonus with Bokeh
is that you get basic interactivity for free.

The second method for creating visualization is to build a Bokeh app that provides
more flexibility and customization options. The downside is that you do need to run
a seperate application to serve the data. This works really well for individual
or small group analysis. Deploying to the world at large takes a little more effort.

I based this example on an application I am developing at work to interactively
explore price and volume relationships. I have found that the learning curve is
a little steep with the Bokeh app approach but the results have been fantastic.
The gallery examples, are another rich source for understanding Bokeh’s capabilities.
By the end of this article, I hope you feel the same way I do about the possibilities
of using Bokeh for building powerful, complex, interactive visualization tools.

Building the App

If you are using Anaconda, then install bokeh with conda:

conda install bokeh

For this app, I am going to use the single file approach as described
here.

The final file, is stored in the github repo and I will keep that updated if
people identify changes or improvements in this script. In addition, here is the
processed csv file.

The first step is to import several modules we will need to build the app:

import pandas as pd
from bokeh.plotting import figure
from bokeh.layouts import layout, widgetbox
from bokeh.models import ColumnDataSource, HoverTool, BoxZoomTool, ResetTool, PanTool
from bokeh.models.widgets import Slider, Select, TextInput, Div
from bokeh.models import WheelZoomTool, SaveTool, LassoSelectTool
from bokeh.io import curdoc
from functools import lru_cache

The next step is to create a function to load data from the csv file and return a
pandas DataFrame. I have wrapped this function with the
lru_cache()

decorator
in order to cache the result. This is not strictly required but is useful to minimize
those extra IO calls for loading the data from disk.

@lru_cache()
def load_data():
    df = pd.read_csv("Aussie_Wines_Plotting.csv", index_col=0)
    return df

In order to format the details, I am defining the ordering of the columns as
well as the list of all the provinces we may want to filter by. For this example,
I hard coded the list but in other situations you could dynamically build the list
off the data.

# Column order for displaying the details of a specific review
col_order = ["price", "points", "variety", "province", "description"]

all_provinces = [
    "All", "South Australia", "Victoria", "Western Australia",
    "Australia Other", "New South Wales", "Tasmania"
]

Now that some of the prep work is out of the way, I will get all of the Bokeh widgets
set up. The
Select

,
Slider

and
TextInput

widgets capture
input from the user. The
Div

widget will be used to display
output based on the data being selected.

desc = Div(text="All Provinces", width=800)
province = Select(title="Province", options=all_provinces, value="All")
price_max = Slider(start=0, end=900, step=5, value=200, title="Maximum Price")
title = TextInput(title="Title Contains")
details = Div(text="Selection Details:", width=800)

Here’s what the widgets look like in the final form:

Widgets

The “secret sauce” for Bokeh is the
ColumnDataSource.

This object stores
the data the rest of the script will visualize. For the initial run through of
the code, I will load with all the data. In subsequent code, we can update the
source with selected or filtered data.

source = ColumnDataSource(data=load_data())

Every Bokeh plot supports interactive tools. Here’s what the tools look
like for this specific app:

Tool bar

The actual building of the tools is straightforward. You have the option of
defining tools as a list of strings but it is not possible to customize the tools
when you use this approach. In this application, it is useful to define the
hover tool to show the title of the wine as well as its variety. We can use
any column of data that is available to us in our DataFrame and reference it
using the
@.

hover = HoverTool(tooltips=[
    ("title", "@title"),
    ("variety", "@variety"),
])
TOOLS = [
    hover, BoxZoomTool(), LassoSelectTool(), WheelZoomTool(), PanTool(),
    ResetTool(), SaveTool()
]

Bokeh uses
figures

as the base object for creating a visualization.
Once the figure is created, items can be placed on the figure. For this use case,
I decided to place circles on the figure based on the price and points assigned
to each wine.

p = figure(
    plot_height=600,
    plot_width=700,
    title="Australian Wine Analysis",
    tools=TOOLS,
    x_axis_label="points",
    y_axis_label="price (USD)",
    toolbar_location="above")

p.circle(
    y="price",
    x="points",
    source=source,
    color="variety_color",
    size=7,
    alpha=0.4)

Now that the basic plot is structured, we need to handle changes to the data and
make sure the appropriate updates are made to the visualization. With the addition
of a few functions, Bokeh does most of the heavy lifting to keep the visualization updated.

The first function is
select_reviews.

The basic purpose
of this function is to load the full dataset, apply any filtering based on user
input and return the filtered dataset as a pandas DataFrame.

In this particular example, we can filter data based on the maximum price,
province and string value in the title. The function uses standard pandas
operations to filter the data and get it down to a subset of data in the

selected

DataFrame. Finally, the function updates the description
text to show what is being filtered.

def select_reviews():
    """ Use the current selections to determine which filters to apply to the
    data. Return a dataframe of the selected data
    """
    df = load_data()

    # Determine what has been selected for each widgetd
    max_price = price_max.value
    province_val = province.value
    title_val = title.value

    # Filter by price and province
    if province_val == "All":
        selected = df[df.price <= max_price]
    else:
        selected = df[(df.province == province_val) & (df.price <= max_price)]

    # Further filter by string in title if it is provided
    if title_val != "":
        selected = selected[selected.title.str.contains(title_val, case=False) == True]

    # Example showing how to update the description
    desc.text = "Province: {} and Price < {}".format(province_val, max_price)
    return selected

The next helper function is used to update the
ColumnDataSource

we
setup earlier. This is straightforward with the exception of specifically
updating
source.data

versus just assigning a new source.

def update():
    """ Get the selected data and update the data in the source
    """
    df_active = select_reviews()
    source.data = ColumnDataSource(data=df_active).data

Up until now, we have focused on updating data when the user interacts
with the custom defined widgets. The other interaction we need to handle is when the
user selects a group of points via the LassoSelect tool. If a set of points
is selected, we need to get those details and display them below the graph.
In my opinion this is a really useful feature that enables some very intuitive
exploration of the data.

I will go through this function in smaller sections since there are some unique
Bokeh concepts here.

Bokeh keeps track of what has been selected as a 1d or 2d array depending on the
type of selection tool. We need to pull out the indices of all selected items
and use that to get a subset of data.

def selection_change(attrname, old, new):
    """ Function will be called when the poly select (or other selection tool)
    is used. Determine which items are selected and show the details below
    the graph
    """
    selected = source.selected["1d"]["indices"]

Now that we know what was selected, let’s get the latest dataset based on any
filtering that the user has done. If we do not do this, the indices will not
match up. Trust me, it took me a while to figure this out!

df_active = select_reviews()

Now, if data is selected, let’s get that subset of data and transform it
so that it is easy to compare side by side. I used the
style.render()

function to make the HTML more styled and consistent with the rest of the
app. As an aside, this new API in pandas allows for a lot more customization
of the HTML output of a DataFrame. I’m keeping it simple in this case, but
you can explore more in the pandas style docs .

if selected:
    data = df_active.iloc[selected, :]
    temp = data.set_index("title").T.reindex(index=col_order)
    details.text = temp.style.render()
else:
    details.text = "Selection Details"

Here is what the selection looks like.

Tool bar

Now that the widgets and other interactive components are built and the process
for retrieving and filtering data is in place, they all need to be tied together.

For each control, make sure updates call the
update

function and include the
old and new values.

controls = [province, price_max, title]

for control in controls:
    control.on_change("value", lambda attr, old, new: update())

If there is a selection, call the
selection_change

function.

source.on_change("selected", selection_change)

The next section controls the layout. We setup the
widgetbox

as well as
the
layout

.

inputs = widgetbox(*controls, sizing_mode="fixed")
l = layout([[desc], [inputs, p], [details]], sizing_mode="fixed")

We need to do an initial update of the data, then attach this model and its layout
to the current document. The last line adds a title for the browser window.

update()
curdoc().add_root(l)
curdoc().title = "Australian Wine Analysis"

If we want to execute the app, run this from the command line:

bokeh serve winepicker.py

Open up the browser and go to http://localhost:5006/winepicker and explore the data.

Demo

I have created a video that walks through the
interactive nature of the application. I think this brief video does a good
job of showing all the interactive options available with this approach. If you have
been interested in enough to read this far, it is worth your time to
watch the video and see the app in action.

Summary

There are many options for visualizing data within the python ecosystem. Bokeh
specializes in making visualizations that have a high degree of interactive capability out
of the box as well as the ability to customize even further with some additional coding.
In my experience, there is a bit of a learning curve to get these apps working
but they can be very useful tools for visualizing data.

I hope this article will be a useful guide for others that are interested in building
their own custom visualizations for their unique business problems. Feel free to leave
a comment if this post is helpful.

Edits

29-Jan-2018: Fixed single vs double quotes for consistency. Also made sure title search was not case sensitive.