Notable Open Source Tools
This list contains tools I have used before but do not use on a daily basis. You may want to check out my Daily Software.
Data Prep
Tabula
Tabula is a handy tool to extract data from PDF files.
If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is — you can’t easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface. website
Reporting
Tableau Public
Tableau, in my experience, is probably the fastest way to either map geographical data, or make a beautiful interactive dashboard. Just import a file or hookup a database connection and start to drag-and-drop data points onto plots.
Platforms
KNIME
I have used KNIME for basic data transformation, automate ETL pipelines, and build full reporting webapps. Unfortunately you would need to pay to host the KNIME server somewhere to take full advantage of web access to workflows, but you can still do everything else with the desktop version.
Programming
The above software was mostly basic drag-and-drop style. To really break into the data science world, it is recommended to learn a programming language. I recommend Python. See my notes on it below, and learn some skill site like codeacademy
Anaconda
Want to start with python? Just download Anaconda
With over 4.5 million users, Anaconda is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Anaconda’s flagship product, Anaconda Enterprise, allows organizations to secure, govern, scale and extend Anaconda to deliver actionable insights that drive businesses and industries forward. - website
Jupyter
If you installed Anaconda, you have Jupyter. Now just fire up a notebook from the command line with jupyter notebook
, navigate to localhost:8888
and start programming, visualizing, and taking notes (e.g. markdown, html, latex etc.). For more see Jupyter Quickstart. Note, you can program in other languages besides Python by installing different kernals.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. - website
Rodeo
Rodeo is a nice Python IDE, basically RStudio for Python.
Rodeo is a development environment that’s lightweight and intuitive, yet customizable to its core - your own personal home base for exploring and interpreting data. - website