What's faster for Stata: manipulating data in a flat database (i.e. Excel) or in a relational database?

I'm an entry-level optimization analyst at a company that publishes risk ratings data for various companies. We have tons of data (to the point where our history is currently solely limited by the number of rows possible in Excel).

We currently use many .do files in Stata to perform all manipulations and statistical analyses (the largest production we run takes 9 hours, with one insheet taking half a minute). I'm trying to convince the company to move away from using a flat database to using a relational database but have been having trouble finding information online about whether flat or relational is better in Stata. So--which is better, and why?


I would hypothesise that you answered your own questions by emphasising that limitations of Excel prevent you from capitalising on the full potential of your data. Excel is not a proper analytical tool or data warehousing solution and as such there is no point in using it in analytical projects involving anything more complex than doing some basic sums for a small business / household needs.

To answer your question:

  1. Flat file databases are an archaic technology dating to the beginnings of computer science: they were never designed to meet modern analytical needs of working with Big Data, live data streams, etc.

  2. Relational databases

    • help to avoid data duplication
    • help to avoid inconsistent records
    • are easier when changing the data format


 ? Function to compute 3D gradient with unevenly spaced sample locations
 ? Simple LINQ DataTable Aggregation Doesn't Work
 ? Simple LINQ DataTable Aggregation Doesn't Work
 ? Simple LINQ DataTable Aggregation Doesn't Work
 ? Get all column names of a DataTable into string array using (LINQ/Predicate)
 ? LINQ filter Datatable by n Columns
 ? Aggregate rows in a DataTable using LINQ
 ? C# Linq Registry values to DataTable
 ? Join the datatable using Linq
 ? Get elements of Series B not in A