In august 2014, the organization Poderopedia published The Iberoamerican Data Driven Journalism Manual (spanish). The ebook is a compilation of articles with advices and techniques of more than 30 reporters, visualizers and programmers. I had the honor to be part of the project. Below is the english version of the article I wrote.
Interview with a database
A database is like any other source that journalists face daily. It is prone to tell lies, hide information, give us a partial picture of a phenomenon, and lead us to mistake. Although we expect a lot from numbers, they are indeed fallible and do not possess the absolute truth because, simply put, databases are made by people.
There we might find involuntary or deliberate errors and it is always advisable to keep this in mind before using them as a basis for investigative journalism.
That is why data sets should be treated with the same rigor as our human and documentary sources: validate their content and verify their authenticity with third parties.
Conventional reporting is not only unavoidable but an obligation in data journalism to prevent the publication of erroneous conclusions that could end up being a tragedy for the career of a journalist and even his/her news outlet.
Having clarity on these points, the next thing to be done prior to interviewing a database is to know it thoroughly. Investigate in depth about:
- Who collected the numbers and what were their purpose?
- How was the collection methodology?
- How reliable is the person or institution that created it?
- Is the document complete, was some information excluded, or it is one that only contains some basic crossed data attempting to meet the journalist´s requirements?
- What interests are pursued by the person delivering the data?
- Or rather, what could the person or institution that refuses to provide it try to hide?
- Does that record of figures provide all the information needed to start a project, should you look for other bases or even create your own?
Having answered those questions and, if most of the answers are satisfactory to the journalist, the next step is to spend lots of time examining, sometimes to the point of obsession, those spreadsheets in Excel, Tableau, SPSS, SQL or any computer program used for information analysis.
Personally, I turn to the first two because I have found that, so far, they have been greatly useful in the execution of most data journalism studies I have done.
Only when you invest a good amount of hours understanding the structure of a database you will be ready to properly interview the same and draw meaningful and juicy conclusions that will become the pillars of a successful project.
Doing so is vital to notice inconsistencies such as misspelled figures or names repeated or written differently although they belong to the same entity.
These are oversights that will result in underestimated calculations and alter the results of the investigation.
When my colleague Amy Ross and I assessed the database with information on 643 high schools, one of them stood out as the one that most decreased the phenomenon nationwide.
The numbers entered in the official register of the Ministry of Education said that in this institution the leakage of students had gone from 68% of enrollments in 2011 to 14% in 2013, that is, the problem decreased by 53 points.
The change was so extreme that it aroused some suspicion. When I talked with the principal of that school to contrast the absolute and relative data, he reviewed his records and confirmed the misspelling in the number of student dropout last year; the actual figure was 50% of the students.
Another benefit of exploring a database in depth is noting missing numbers.
Again, in the school dropout project, it caught our attention that in one of the largest institutions (with more than 1,000 students), the desertion of students had gone from 445 in 2012 to zero in 2013.
Obviously, there was something missing there. Indeed, the Ministry of Education confirmed that because of «an involuntary error» the 694 students who left the education center in 2013 had not been included in that report.
That number was critical; without it, we would have missed that this center is one where the student dropout is more significant.
We must pay close attention to those details. Just imagine what would happen if, using the figures shown in the country´s crime rate, you don´t notice the lack of number of robberies, assaults, and murders in key municipalities. All your work would go down the drain because you would come to false conclusions.
After a thorough examination of the database, you will know exactly whether or not it is capable of resolving, partially or totally, the unknowns you have on the issue to investigate.
It is convenient then, to list the questions you are trying to answer when you analyze the Excel document with the help of filters and pivot tables.
If you don´t know how to use these tools, I recommend these tutorials from the Centro para el Periodismo de Investigación (Center for Investigative Journalism ) and the Consorcio Internacional de Periodistas de Investigación (International Consortium of Investigative Journalists).
Let´s suppose that the database in question is the crime rate that I mentioned above. Going from the general to the specific, some basic queries you can include in that interview are as follows:
- What is the total number of crimes that occurred in the country during the year or years for which figures are available?
- Has crime increased or decreased?
- What are the most common types of crimes and their frequency per year? have they risen or declined?
- Which is the municipality where crime has increased the most, overall and by type of incident?
- Conversely, which is the municipality where crime indicators have gone down?
Always remember for cases like this or regarding disease incidence, to calculate rates per 10,000 or 100,000 inhabitants. It´s the most reliable way to verify if a phenomenon has increased or decreased in time.
To do this, you must have information on the total population of each municipality for the years of interest.
Let´s suppose that last year there were 40 serious crimes in your jurisdiction and the total population is 50,000 inhabitants. The Excel formula to calculate the rate is as follows:
Using this example, it may be concluded that in 2013 there were 8 crimes per 10,000 inhabitants: we may wonder, is that amount greater or lesser than in 2004 when 25 crimes were recorded in total?
If, in 2004 the population was 30,000 people, according to the above formula, we would conclude that the crime rate has remained 8 crimes per 10,000 inhabitants.
With this comparative data you might question:
- What was the behavior of crime in my municipality for each year from 2005 to 2012?
- Was the rate kept close to 8 violent acts per 10,000 inhabitants or were there any variations between those years?
- Were these changes abrupt or not?
- If changes were abrupt: why is the fight against crime a seesaw year to year?
- If the rate is stable, why does it remain like that?
- How many police officers are there per 10,000 inhabitants in the city?
- What is the budget that authorities invest annually in security?
- Is the crime rate high or low in my municipality with respect to other provinces or the country as a whole?
As you can see, a database can and should be interviewed several times during the investigation, as with any other source.
Often, some of the answers given will trigger new questions whose answers will be immersed in other databases or require the use of other documents and official spokesmen.
Finally, never forget to reflect on the most crucial question to be made to a database, why is the story told by their figures important to people?