Kaggle Pandas Cheat Sheet



This post updates a previous very popular post 50+ Data Science, Machine Learning Cheat Sheets by Bhavya Geethika. If we missed some popular cheat sheets, add them in the comments below.

My Python Pandas Cheat Sheet. The pandas functions I use every day as a data scientist and software engineer. Download the Anime recommendation dataset from Kaggle. 4x Kaggle GM, Abhishek Thakur says that he frequently finds himself using TensorFlow for NLP problems and PyTorch for computer vision problems. When it comes to favourite Python libraries, Thakur is in praise for Scikit-learn and how significant this library is in providing many necessary components to put a model into production.

Cheatsheets on Python, R and Numpy, Scipy, Pandas

Data science is a multi-disciplinary field. Thus, there are thousands of packages and hundreds of programming functions out there in the data science world! An aspiring data enthusiast need not know all. A cheat sheet or reference card is a compilation of mostly used commands to help you learn that language’s syntax at a faster rate. Here are the most important ones that have been brainstormed and captured in a few compact pages.

Mastering Data science involves understanding of statistics, mathematics, programming knowledge especially in R, Python & SQL and then deploying a combination of all these to derive insights using the business understanding & a human instinct—that drives decisions.

Here are the cheat sheets by category:

Cheat sheets for Python:

Python is a popular choice for beginners, yet still powerful enough to back some of the world’s most popular products and applications. It's design makes the programming experience feel almost as natural as writing in English. Python basics or Python Debugger cheat sheets for beginners covers important syntax to get started. Community-provided libraries such as numpy, scipy, sci-kit and pandas are highly relied on and the NumPy/SciPy/Pandas Cheat Sheet provides a quick refresher to these.

  1. Python Cheat Sheet by DaveChild via cheatography.com
  2. Python Basics Reference sheet via cogsci.rpi.edu
  3. OverAPI.com Python cheatsheet
  4. Python 3 Cheat Sheet by Laurent Pointal

Cheat sheets for R:

The R's ecosystem has been expanding so much that a lot of referencing is needed. The R Reference Card covers most of the R world in few pages. The Rstudio has also published a series of cheat sheets to make it easier for the R community. The data visualization with ggplot2 seems to be a favorite as it helps when you are working on creating graphs of your results.

At cran.r-project.org:

At Rstudio.com:

Kaggle pandas cheat sheet free
  1. R markdown cheatsheet, part 2

Others:

Sheet
  1. DataCamp’s Data Analysis the data.table way

Cheat sheets for MySQL & SQL:

For a data scientist basics of SQL are as important as any other language as well. Both PIG and Hive Query Language are closely associated with SQL- the original Structured Query Language. SQL cheatsheets provide a 5 minute quick guide to learning it and then you may explore Hive & MySQL!

  1. SQL for dummies cheat sheet

Cheat sheets for Spark, Scala, Java:

Apache Spark is an engine for large-scale data processing. For certain applications, such as iterative machine learning, Spark can be up to 100x faster than Hadoop (using MapReduce). The essentials of Apache Spark cheatsheet explains its place in the big data ecosystem, walks through setup and creation of a basic Spark application, and explains commonly used actions and operations.

  1. Dzone.com’s Apache Spark reference card
  2. DZone.com’s Scala reference card
  3. Openkd.info’s Scala on Spark cheat sheet
  4. Java cheat sheet at MIT.edu
  5. Cheat Sheets for Java at Princeton.edu

Cheat sheets for Hadoop & Hive:

Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. Explore the Hadoop cheatsheets to find out Useful commands when using Hadoop on the command line. A combination of SQL & Hive functions is another one to check out.

Cheat sheets for web application framework Django:

Django is a free and open source web application framework, written in Python. If you are new to Django, you can go over these cheatsheets and brainstorm quick concepts and dive in each one to a deeper level.

  1. Django cheat sheet part 1, part 2, part 3, part 4

Cheat sheets for Machine learning:

We often find ourselves spending time thinking which algorithm is best? And then go back to our big books for reference! These cheat sheets gives an idea about both the nature of your data and the problem you're working to address, and then suggests an algorithm for you to try.

  1. Machine Learning cheat sheet at scikit-learn.org
  2. Scikit-Learn Cheat Sheet: Python Machine Learning from yhat (added by GP)
  3. Patterns for Predictive Learning cheat sheet at Dzone.com
  4. Equations and tricks Machine Learning cheat sheet at Github.com
  5. Supervised learning superstitions cheatsheet at Github.com

Cheat sheets for Matlab/Octave

MATLAB (MATrix LABoratory) was developed by MathWorks in 1984. Matlab d has been the most popular language for numeric computation used in academia. It is suitable for tackling basically every possible science and engineering task with several highly optimized toolboxes. MATLAB is not an open-sourced tool however there is an alternative free GNU Octave re-implementation that follows the same syntactic rules so that most of coding is compatible to MATLAB.

Cheat sheets for Cross Reference between languages

Related:

Sponsored by Datadog: pythonbytes.fm/datadog

Special guest: Vicki Boykis: @vboykis

Michael #1:clize: Turn functions into command-line interfaces

  • via Marcelo
  • Follow up from Typer on episode 164.
  • Features
    • Create command-line interfaces by creating functions and passing them to [clize.run](https://clize.readthedocs.io/en/stable/api.html#clize.run).
    • Enjoy a CLI automatically created from your functions’ parameters.
    • Bring your users familiar --help messages generated from your docstrings.
    • Reuse functionality across multiple commands using decorators.
    • Extend Clize with new parameter behavior.
  • I love how this is pure Python without its own API for the default case
Kaggle

Vicki #2:How to cheat at Kaggle AI contests

  • Kaggle is a platform, now owned by Google, that allows data scientists to find data sets, learn data science, and participate in competitions
  • Many people participate in Kaggle competitions to sharpen their data science/modeling skills
  • Recently, a competition that was related to analyzing pet shelter data resulted in a huge controversy
  • Petfinder.my is a platform that helps people find pets to rescue in Malaysia from shelters. In 2019, they announced a collaboration with Kaggle to create a machine learning predictor algorithm of which pets (worldwide) were more likely to be adopted based on the metadata of the descriptions on the site.
  • The total prize offered was $25,000
  • After several months, a contestant won. He was previously a Kaggle grandmaster, and won $10k.
  • A volunteer, Benjamin Minixhofer, offered to put the algorithm in production, and when he did, he found that there was a huge discrepancy between first and second place
  • Technical Aspects of the controversy:
    • The data they gave asked the contestants to predict the speed at which a pet would be adopted, from 1-5, and included input features like type of animal, breed, coloration, whether the animal was vaccinated, and adoption fee
    • The initial training set had 15k animals and the teams, after a couple months, were then given 4k animals that their algorithms had not seen before as a test of how accurate they were (common machine learning best practice).
    • In a Jupyter notebook Kernel on Kaggle, Minixhofer explains how the winning team cheated
    • First, they individually scraped Petfinder.my to find the answers for the 4k test data
    • Using md5, they created a hash for each unique pet, and looked up the score for each hash from the external dataset - there were 3500 overlaps
    • Did Pandas column manipulation to get at the hidden prediction variable for every 10th pet and replaces the prediction that should have been generated by the algorithm with the actual value
    • Using mostly: obfuscated functions, Pandas, and dictionaries, as well as MD5 hashes
  • Fallout:
    • He was fired from H20.ai
    • Kaggle issued an apology

Michael #3: Configuring uWSGI for Production Deployment

  • We run a lot of uWSGI backed services. I’ve spoken in-depth back on Talk Python 215: The software powering Talk Python courses and podcast about this.
  • This is guidance from Bloomberg Engineering’s Structured Products Applications group
  • We chose uWSGI as our host because of its performance and feature set. But, while powerful, uWSGI’s defaults are driven by backward compatibility and are not ideal for new deployments.
  • There is also an official Things to Know doc.
  • Unbit, the developer of uWSGI, has “decided to fix all of the bad defaults (especially for the Python plugin) in the 2.1 branch.” The 2.1 branch is not released yet.
  • Warning, I had trouble with die-on-term and systemctl
  • Settings I’m using:

Vicki #4: Thinc:A functional take on deep learning, compatible with Tensorflow, PyTorch, and MXNet

  • A deep learning library that abstracts away some TF and Pytorch boilerplate, from Explosion
  • Already runs under the covers in SpaCy, an NLP library used for deep learning
  • type checking, particularly helpful for Tensors: PyTorchWrapper and TensorFlowWrapper classes and the intermingling of both
  • Deep support for numpy structures and semantics
  • Assumes you’re going to be using stochastic gradient descent
  • And operates in batches
  • Also cleans up the configuration and hyperparameters
  • Mainly hopes to make it easier and more flexible to do matrix manipulations, using a codebase that already existed but was not customer-facing.
  • Examples and code are all available in notebooks in the GitHub repo

Michael #5: pandas-vet

  • via Jacob Deppen
  • A plugin for Flake8 that checks pandas code
  • Starting with pandas can be daunting.
  • The usual internet help sites are littered with different ways to do the same thing and some features that the pandas docs themselves discourage live on in the API.
  • Makes pandas a little more friendly for newcomers by taking some opinionated stances about pandas best practices.
  • The idea to create a linter was sparked by Ania Kapuścińska's talk at PyCascades 2019, 'Lint your code responsibly!'

Vicki #6: NumPy beginner documentation

  • NumPy is the backbone of numerical computing in Python: Pandas (which I mentioned before), scikit-learn, Tensorflow, and Pytorch, all lean heavily if not directly depend on its core concepts, which include matrix operations through a data structure known as a NumPy array (which is different than a Python list) - ndarray
  • Anne Bonner wrote up new documentation for NumPy that introduces these fundamental concepts to beginners coming to both Python and scientific computing
  • Before, you went directly to the section about arrays and had to search through it find what you wanted. The new guide, which is very nice, includes a step-by-step on how arrays work, how to reshape them, and illustrated guides on basic array operations.

Kaggle Pandas Cheat Sheet

Extras:

Vicki

Kaggle Pandas Cheat Sheets

  • I write a newsletter, Normcore Tech, about all things tech that I’m not seeing covered in the mainstream tech media. I’ve written before about machine learning, data for NLP, Elon Musk memes, and Nginx.
  • There’s a free version that goes out once a week and paid subscribers get access to one more newsletter per week, but really it’s more about the idea of supporting in-depth writing about tech. vicki.substack.com

Michael:

Kaggle Pandas Cheat Sheet

Kaggle Pandas Cheat Sheet Free

  • pip 20.0 Released - Default to doing a user install (as if --user was passed) when the main site-packages directory is not writeable and user site-packages are enabled, cache wheels built from Git requirements, and more.
  • Homebrew: brew install python@3.8

Joke:

An SEO expert walks into a bar, bars, pub, public house, Irish pub, tavern, bartender, beer, liquor, wine, alcohol, spirits...