Exploring Data Visually
In 1977, statistician John Tukey published his book Exploratory Data Analysis, which detailed how and encouraged data professionals to analyze data through visualization. This was during a time when most analysis was performed in the context of hypothesis tests and statistical models, one computer filled a room, and graphs were typically drawn by hand. For example, in his book, Tukey provides a tip on how to draw darker symbols with a pen instead of a pencil.
Nevertheless, although the technology was bigger and slower back then, the driving principle is the same. You can see a lot in a picture, and what you see can lead to answers or generate more questions you otherwise never would have thought of.
“The greatest value of a picture is when it forces us to notice what we never expected to see .” —John W . Tukey, Exploratory Data Analysis (1977)
The public-facing side of visualization—the polished graphics that you see in the news, on websites, and in books—are fine examples of data graphics at their best, but what is the process to get to that final picture? There is an exploration phase that most people never see, but it can lead to visualization that is a level above the work of those who do not look closely at their data. The better that you under-stand what your data is about, the better you can communicate your findings.
Even if you don’t plan to show your results to a wide audience, visualization as an analysis tool enables you to explore data and find stories that you might not find with formal statistical tests. You just need to know what to look for and what questions to ask based on the data that you have available.
The great thing is that tools and access to data are less of a limiting factor than they were in Tukey’s time, so you aren’t stuck with just pencils, paper, and a ruler to draw thousands of dots and lines.
The specific steps you take in any analysis varies by dataset and project, but generally speaking, you should consider these four things when you explore your data visually.
- What data do you have?
- What do you want to know about your data?
- What visualization methods should you use?
- What do you see and does it makes sense?
The answer to each question depends on the answers that come before, and it’s common to jump back and forth between questions. As shown in Figure 4-1, it’s an iterative process. For example, if your dataset is only a handful of observations, this limits what you can find in your data and what visualization methods are useful, and you won’t see much.
On the other hand, if you have a lot of data, what you see when you visualize
one aspect of it can lead to a curiosity about other dimensions, which in turn
leads to different graphics. This is the fun part.
WHAT DATA DO YOU HAVE?
People often form a picture in their head of what a visualization should look like or find an example that they want to mimic. The excitement is great, but when it’s time to visualize, they realize they either need more data or their data doesn’t work with the chart they want to make.
The common mistake is to form a visual first and get the data later. It should be the other way around—data first and visualization follows.
Often, getting the data that you need is the hardest and most time-consuming part of the visualization process. In school, data is handed to you formatted the way you need so that you can easily load it into the software of choice, but this is hardly ever the case in practice. You might need to scrape data from a website, access an API, or derive values from existing data.
For example, you might have a list of addresses, but to map them, you need latitude and longitude coordinates. Or you have observations for individuals of a population, but you might be more interested in subpopulations.
Programming can be helpful in this case to automate parts of the process, but there are a growing number of click-and-play applications to manage data, too.
When you have data you want to explore, pause for a sec-ond to consider what values represent, where the data is from, and how variables were measured.
WHAT VISUALIZATION METHODS SHOULD YOU USE?
As you saw in the previous chapter, there are many chart options and combinations of visual cues to choose from. It’s easy to obsess over picking the right chart for your data, but during the early stages of exploration, it’s more important to see your data from different angles and to drill down to what matters for your project.
Make multiple charts, compare all your variables, and see if there are interesting bits that are worth a closer look. Look at your data as a whole and then zoom in on categories and individual data points.
This is also a great time to experiment with visual forms. Try different scales, colors, shapes, sizes, and geometries, and you might stumble upon a graphic worth pursuing further.
You don’t always need to stick to the visual cues that are the “best” at showing data most accurately and are easiest to read. When exploration is your goal, don’t let a list of best practices stop you from trying something different because complex data often requires complex visualization.
For example, Figure 4-3 shows an interactive exploration of article deletions on Wikipedia by Mortiz Stefaner, Dario Taraborelli, and Giovanni Luca Ciampaglia. Wikipedia is a large resource of data with small and large data tables within articles, article edits over time, and user interaction with articles and between each other. The data can be explored on many dimensions, but the topic focus of Notabilia shows a clearer picture.
Each branch represents a user discussion about whether an article should be deleted, and those that curl to the right are discussions that lean strongly for deletion. A curl to the left is a discussion leaning toward keeping an article. The more prominent the curl is, the stronger the agreement between users. Although the visualization isn’t traditional, you get still get something out of it.
That said, traditional visualization, such as bar graphs and line charts, can be made easily and read quickly, which makes them great tools to explore data.
As your goals shift, so do your choices of visualization. If you were to design a dashboard that provides the status of a system at a glance, you must visualize the data in a way that is straightforward to digest. On the other hand, if the goal is to encourage reflection or to evoke emotions, efficiency might not be your main concern.
Excerpted with permission from the publisher, Wiley, from Data Points: Visualization That Means Something by Nathan Yau. Copyright © 2013
Nathan Yau, author of Data Points: Visualization That Means Something, has a PhD in statistics and is a statistical consultant who helps clients make use of their data through visualization. He created the popular site FlowingData.com, and is the author of Visualize This: The FlowingData Guide to Design, Visualization, and Statistics, also published by Wiley.