Free data tools to consider

YData Profiling

YData Profiling is data profiler with a FOSS component and a paid upgrade. It is easy to use and powerful – It is a solid choice if you are working in the Spark ecosystem with python. If you can load your data into a Pandas dataset then you can profile it with results in an interactive web page. Great for initial exploration, you can also incorporate it into your ETL pipelines with results captured in JSON structures or output to HTML.

The default settings are usually good, but I have found correlations can be problematic on large datasets. Disable them or use a sample dataset for correlation exploration.

Like many tools, there is the free version and the up-sell version with more features. With YData, there is a paid tier that works well with relational databases and produces a data catalog among other features. The free version, hosted on github, is quite capable.

Great Expectations

Great Expectations is a data quality package with a FOSS component and a paid upgrade (a pattern is developing). You can use it for data profiling and exploration, although I prefer YData Profiling for this. Great Expectations provides a framework for defining data quality tests, executing them, and acting on the test results. For example, you may expect that a given column is not null, or that it is unique. GE provides a large set of expectations out of the box, you can also code your own within the framework. Expectations can be combined into suites and incorporated into your pipelines. You can hang alerts off of the expectations. You can generate web sites that summarize each the data quality tests and results from each pipeline execution. The results of each test are stored in JSON files that are easy to manipulate. All together it’s a strong framework for incorporating data quality testing in your ETL pipelines.

Glen Pennington

Where to find me on the socials