from training.npr.org: http://training.npr.org/visual/what-to-do-with-a-big-pile-of-data/
How to 'interview' a big pile of data
When confronted with a big pile of data, these tips will help you find sense in the numbers, find story ideas, and ask further questions.
Like interviewing people, these techniques won’t necessarily uncover a smoking gun. In fact, such analysis rarely leads to great insights. That’s precisely why learning how to quickly ask some basic questions of data can be so useful.
This guide assumes you have knowledge of the fundamental spreadsheet concepts of cells and formulas.
Note: All of these tips are based on the City of Chicago’s Current Employee Names, Salary, Position, and Department dataset.
Get the data about the data
Your first job is to find out everything you can about the data. In this case, clicking the “about” button on the dataset page above provides this information:
Before publishing anything using this data, you’ll need to verify the details of this data.
Based on the publication date and description (“updated quarterly), it seems the data is current as of Q4, 2014. But that’s just an assumption. You still need to do your due diligence and contact the owner to confirm details about the data. At least we know the institutional owner (Human Resources), and a specific person we can contact (Eric Phillips).
Good questions to ask when researching data:
- Who created it?
- Who is responsible for maintaining it?
- What point in time or time-range does it apply to?
- How was the data collected?
- How was the data processed after being collected?
- How was each field collected or calculated?
Get it into a spreadsheet
We’ll be using Google Sheets, which is free and useful for collaboration. Excel is also great.
If you’ve got a sensitive data set — from, say, Edward Snowden — then you probably don’t want it in Google Sheets, a system that could be compromised by a motivated government or hacker.
Clean up and format the sheet
A little formatting can go a long way.
Size your columns so that text isn’t badly truncated. Use number formatting to add commas to large numbers and set the appropriate number of decimals for small/precise numbers. Use currency formatting for money and percent for fractions.
To make browsing and sorting easier, fix the header to the top of the sheet by dragging the bottom border of the cell headers downward.
Sort it this way, sort it that way
If your data has numerical values (e.g. employee salaries), sort the numerical columns to find the maximum and minimum values. To sort, hover on a column header, click the teensy down arrow that appears, and select one of the two sort options.
The sort option names might be a little misleading. The “A → Z” sort option sorts from smallest to largest, while the “Z → A” sort option sorts from largest to smallest.
We’ll start by sorting the salary database from highest to lowest (“Z → A”):
This matches what you might expect to see — top level city employees like the police superintendent and mayor are the big earners.
Now let’s sort the other way, from lowest-to-highest:
That’s more like it! Turns out there’s a guy in the mayor’s office who is paid $0.96 a year.
A cursory Google search reveals that Steven Koch isn’t just an administrative secretary but is deputy mayor, a former Wall Street investment banker, and considered one of the most powerful people in Chicago.
Two simple sorts, and now we’ve got our first story lead. Pro-tip: As far as we can tell, nobody’s done a story about this rather influential person’s 96 cent salary. Hmm….
Group and count with pivot tables
Pivot tables allow you to shuffle the data to see if you can discover something beyond the standard organization of the numbers. Specifically, they let you count, sum, and calculate based on categories in your data.
Pivot tables get a bad rap for being complicated and hard to understand. They make more sense once you start with a question. Here are a few: How many employees does each department have? How much total salary is paid by each department? What’s the median salary for each department?
If the question sounds like “What is the <summary of variable> for each <category>?”, the answer is probably a pivot table.
Let’s do it! Create a pivot table from “data” menu, then add your <category> — in our case, department — as the rows:
Now we can count the number of employees per department using the curiously named “COUNTA” summary function:
Why did we use “COUNTA”? Some of it is practice: Experience and some searching reveals COUNTA counts the number of values. But some is experimentation. There’s no single way to analyze this data — this is a fundamentally experimental process.
Wash, rinse, repeat to play around with different types of summaries:
These summary numbers are useful for understanding scale and providing context.
In this case, we see that the Chicago Police Department spends over a billion dollars just on salaries. That’s more than double the total salary expenditure of the Fire Department, the next biggest spender.
We also see that “DoIT” (Department of Innovation and Technology, something else you’ll need to look up and confirm) employees have the highest median salary. This is another potential story or bit of context for your reporting.
Copy your pivot tables to a new sheet
Pivot tables may be powerful, but they are a real pain to sort and analyze like we did at the beginning. In fact, trying to sort can cause your pivot table to get a little funky. You can get around this by copying and pasting them to a new sheet. When you paste, you must select “paste values” or you’ll simply recreate the pivot table and the funky sorting that comes with it.
Express counts as a percentage of the whole
Police and fire salary spending seem like awfully impressive numbers, but how do they compare to the whole? Once you’ve copied your pivot table to a new sheet, some basic formulas can give insight.
Now you can format, sort, and sum your summary sheet:
This lets you include sentences like “the Fire Department’s $423 million dollar salary budget represents over 17% of salaries paid by the city”.
With independent research, we can also put the salary in the context of the larger budget. Chicago’s total budget in 2014 was about $7 billion. $2.43b / $7b = 0.347. Therefore salaries accounted for about 35% of the total city budget in 2014.
David Eads is a news application developer on the NPR Visuals team.