Free data tools to consider

F

YData Profiling

YData Profiling is data profiler with a FOSS component and a paid upgrade. It is easy to use and powerful – It is a solid choice if you are working in the Spark ecosystem with python. If you can load your data into a Pandas dataset then you can profile it with results in an interactive web page. Great for initial exploration, you can also incorporate it into your ETL pipelines with results captured in JSON structures or output to HTML.

The default settings are usually good, but I have found correlations can be problematic on large datasets. Disable them or use a sample dataset for correlation exploration.

Like many tools, there is the free version and the up-sell version with more features. With YData, there is a paid tier that works well with relational databases and produces a data catalog among other features. The free version, hosted on github, is quite capable.

Great Expectations

Great Expectations is a data quality package with a FOSS component and a paid upgrade (a pattern is developing). You can use it for data profiling and exploration, although I prefer YData Profiling for this. Great Expectations provides a framework for defining data quality tests, executing them, and acting on the test results. For example, you may expect that a given column is not null, or that it is unique. GE provides a large set of expectations out of the box, you can also code your own within the framework. Expectations can be combined into suites and incorporated into your pipelines. You can hang alerts off of the expectations. You can generate web sites that summarize each the data quality tests and results from each pipeline execution. The results of each test are stored in JSON files that are easy to manipulate. All together it’s a strong framework for incorporating data quality testing in your ETL pipelines.

By Glen Pennington

Glen Pennington

I have over 30 years of IT experience in the application of computer technology to many fields of business. I have deep experience in developing ETL/ELT pipelines to populate data warehouses, data lakes, and data products built on traditional relational database and cloud based platforms such as Snowflake and the Spark ecosystem.

As a data architect I can design and implement data products with consistent naming standards and rigorous data quality standards. These solutions are built on insights gained through data profiling and enforced through embedded data quality checks.

I have extensive experience using many languages and platforms and I have experience in several software development methodologies.

Where to find me on the socials