Gentle introduction to SASpy

An initial overview of SASpy

Alberto Negron true
12-05-2018

"SASPy is an interface to SAS that automatically generates SAS code from your Python 3.x scripts. You can import the SASPy module into Jupyter Notebook for SAS University Edition, create a connection to the SAS server, and then use Python objects and syntax to access SAS data and your licensed SAS software. You can also exchange data using the pandas software library.

When you submit your Python code, you can view your SAS results in your Jupyter notebook. You can also choose to view the SAS code that is automatically generated".(Institute, n.d.)

The above is how SAS defined saspy there isn’t much too comment about to be honest so let’s get our hands dirty.

We import saspy in the usual way. Notice that I am using SAS Studio (University Edition) so saspy comes pre-installed. If you need to install it then you can do pip install saspy (haven’t tried though).


import saspy
import pandas as pd

Create a SAS Session

It is as simple as follow.


sas = saspy.SASsession()

Using SAS Config named: default
SAS Connection established. Subprocess id is 19082

We will use sas to link subsequent commands to the SASsession().

Load a SAS dataset into a python object

We can read any SAS dataset in any library so long it is available to your SAS profile. So far I have only played with base SAS datasets I wonder if works fine with let’s say an Oracle libname for example. I will cover this in a different post.

We use sasdata method to pass the data set name as first parameter and then the libname. I would have done the other way around, libname first and then the table name to be consistent with libname.sas-data-set-name format. Not a deal breaker anyway.


cars = sas.sasdata('cars','sashelp')
print(type(cars))

<class 'saspy.sasbase.SASdata'>

SAS returns its own python object saspy.sasbase.SASdata but it can be converted to a pandas dataframe using to_df() if this is more convenient.


cars2 =cars.to_df()
print(type(cars2))

<class 'pandas.core.frame.DataFrame'>

SASpy and pandas share common methods that differentiate cosmetically only. Here is an example with describe()


cars.describe() #SAS
Variable Label N NMiss Median Mean StdDev Min P25 P50 P75 Max
0 MSRP 428 0 27635.0 32774.855140 19431.716674 10280.0 20329.50 27635.0 39215.0 192465.0
1 Invoice 428 0 25294.5 30014.700935 17642.117750 9875.0 18851.00 25294.5 35732.5 173560.0
2 EngineSize Engine Size (L) 428 0 3.0 3.196729 1.108595 1.3 2.35 3.0 3.9 8.3
3 Cylinders 426 2 6.0 5.807512 1.558443 3.0 4.00 6.0 6.0 12.0
4 Horsepower 428 0 210.0 215.885514 71.836032 73.0 165.00 210.0 255.0 500.0
5 MPG_City MPG (City) 428 0 19.0 20.060748 5.238218 10.0 17.00 19.0 21.5 60.0
6 MPG_Highway MPG (Highway) 428 0 26.0 26.843458 5.741201 12.0 24.00 26.0 29.0 66.0
7 Weight Weight (LBS) 428 0 3474.5 3577.953271 758.983215 1850.0 3103.00 3474.5 3978.5 7190.0
8 Wheelbase Wheelbase (IN) 428 0 107.0 108.154206 8.311813 89.0 103.00 107.0 112.0 144.0
9 Length Length (IN) 428 0 187.0 186.362150 14.357991 143.0 178.00 187.0 194.0 238.0

cars2.describe() #Pandas
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20334.250000 18866.000000 2.375000 4.000000 165.000000 17.000000 24.000000 3104.000000 103.000000 178.000000
50% 27635.000000 25294.500000 3.000000 6.000000 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
75% 39205.000000 35710.250000 3.900000 6.000000 255.000000 21.250000 29.000000 3977.750000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

Let’s see how many methods are available for SAS datasets:


sas_method_available_to_ds =len([method_name for method_name in dir(cars)])
print(sas_method_available_to_ds)

61

and now, let’s do the same check for pandas dataframes:


pandas_method_available_to_ds =len([method_name for method_name in dir(cars2)])
print(pandas_method_available_to_ds)

469

Pandas has way more methods to deal with dataframes than SAS! But I don’t think SAS strategy is to compete with Pandas. Instead SAS uses pandas as leverage for interfacing other environments. For example, there is no straight forward way to export data from SAS to MongoDB. I’ve seen some macros that are complex and rely on the OS that export the SAS data to a CSV file, then make calls to the mongoimport command to import the csv file in MongoDB.

In my view, a more elegant solutions would have been to convert the SAS data set to a Pandas data frame (either using saspy or read_sas()) convert the data frame to json and copy the data to mongoDB using pymongo. This is indeed a larger solution in terms of lines of code but definitely a more robust version which is OS independent.

Let’s see now some of the methods availble for SAS datasets in saspy:


print([method_name for method_name in dir(cars)])

['HTML', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_checkLogForError', '_dsopts', '_is_valid', '_returnPD', 'assessModel', 'bar', 'columnInfo', 'contents', 'describe', 'dsopts', 'head', 'heatmap', 'hist', 'impute', 'info', 'libref', 'logger', 'means', 'partition', 'results', 'sas', 'scatter', 'score', 'series', 'set_results', 'sort', 'table', 'tail', 'to_csv', 'to_df', 'to_df_CSV', 'to_frame', 'top', 'where']

top give the top n (default = 10) most common records of any given variable.


cars.top(var='make')
Make COUNT PERCENT
0 Toyota 28 6.542056
1 Chevrolet 27 6.308411
2 Mercedes-Benz 26 6.074766
3 Ford 23 5.373832
4 BMW 20 4.672897
5 Audi 19 4.439252
6 Honda 17 3.971963
7 Nissan 17 3.971963
8 Chrysler 15 3.504673
9 Volkswagen 15 3.504673

columnInfo() provides information about columns held in the dataset in standard SAS template.


cars.columnInfo()
Member Num Variable Type Len Pos Format Label
0 SASHELP.CARS 9 Cylinders Num 8 24
1 SASHELP.CARS 5 DriveTrain Char 5 147
2 SASHELP.CARS 8 EngineSize Num 8 16 Engine Size (L)
3 SASHELP.CARS 10 Horsepower Num 8 32
4 SASHELP.CARS 7 Invoice Num 8 8 DOLLAR8.
5 SASHELP.CARS 15 Length Num 8 72 Length (IN)
6 SASHELP.CARS 11 MPG_City Num 8 40 MPG (City)
7 SASHELP.CARS 12 MPG_Highway Num 8 48 MPG (Highway)
8 SASHELP.CARS 6 MSRP Num 8 0 DOLLAR8.
9 SASHELP.CARS 1 Make Char 13 80
10 SASHELP.CARS 2 Model Char 40 93
11 SASHELP.CARS 4 Origin Char 6 141
12 SASHELP.CARS 3 Type Char 8 133
13 SASHELP.CARS 13 Weight Num 8 56 Weight (LBS)
14 SASHELP.CARS 14 Wheelbase Num 8 64 Wheelbase (IN)

I wonder that python modules come shipped with SAS Studio:


help('modules')

Please wait a moment while I gather a list of all available modules…


/usr/lib64/python3.5/site-packages/IPython/kernel/__init__.py:13: ShimWarning: The `IPython.kernel` package has been deprecated since IPython 4.0.You should import from ipykernel or jupyter_client instead.
      "You should import from ipykernel or jupyter_client instead.", ShimWarning)


    /usr/lib64/python3.5/site-packages/sas_kernel/data
    CDROM               asyncore            itertools           select
    DLFCN               atexit              jinja2              selectors
    IN                  audioop             json                setuptools
    IPython             autoreload          jsonschema          shelve
    TYPES               backports           jupyter             shlex
    __future__          base64              jupyter_client      shutil
    _ast                bdb                 jupyter_core        signal
    _bisect             binascii            keyword             simplegeneric
    _bootlocale         binhex              lib2to3             site
    _bz2                bisect              linecache           six
    _codecs             builtins            locale              smtpd
    _codecs_cn          bz2                 logging             smtplib
    _codecs_hk          cProfile            lzma                sndhdr
    _codecs_iso2022     calendar            macpath             socket
    _codecs_jp          cgi                 macurl2path         socketserver
    _codecs_kr          cgitb               mailbox             spwd
    _codecs_tw          chunk               mailcap             sqlite3
    _collections        cmath               markupsafe          sre_compile
    _collections_abc    cmd                 marshal             sre_constants
    _compat_pickle      code                math                sre_parse
    _compression        codecs              metakernel          ssl
    _crypt              codeop              mimetypes           stat
    _csv                collections         mistune             statistics
    _ctypes             colorsys            mmap                storemagic
    _ctypes_test        compileall          modulefinder        string
    _datetime           concurrent          multiprocessing     stringprep
    _dbm                configparser        nbconvert           struct
    _decimal            contextlib          nbformat            subprocess
    _dummy_thread       copy                netrc               sunau
    _functools          copyreg             nis                 symbol
    _gdbm               crypt               nntplib             sympyprinting
    _hashlib            csv                 notebook            symtable
    _heapq              ctypes              ntpath              sys
    _imp                curses              nturl2path          sysconfig
    _io                 cythonmagic         numbers             syslog
    _json               datetime            numpy               tabnanny
    _locale             dateutil            opcode              tarfile
    _lsprof             dbm                 operator            telnetlib
    _lzma               decimal             optparse            tempfile
    _markupbase         decorator           os                  termios
    _md5                difflib             ossaudiodev         test
    _multibytecodec     dis                 pandas              testpath
    _multiprocessing    distutils           pandocfilters       textwrap
    _opcode             doctest             parser              this
    _operator           dummy_threading     pathlib             threading
    _osx_support        easy_install        pdb                 time
    _pickle             email               pexpect             timeit
    _posixsubprocess    encodings           pickle              tkinter
    _pydecimal          ensurepip           pickleshare         token
    _pyio               entrypoints         pickletools         tokenize
    _random             enum                pipes               tornado
    _sha1               errno               pkg_resources       trace
    _sha256             faulthandler        pkgutil             traceback
    _sha512             fcntl               platform            tracemalloc
    _signal             filecmp             plistlib            traitlets
    _sitebuiltins       fileinput           poplib              tty
    _socket             fnmatch             posix               turtle
    _sqlite3            formatter           posixpath           types
    _sre                fractions           pprint              typing
    _ssl                ftplib              profile             unicodedata
    _stat               functools           prompt_toolkit      unittest
    _string             gc                  pstats              urllib
    _strptime           genericpath         pty                 uu
    _struct             getopt              ptyprocess          uuid
    _symtable           getpass             pwd                 venv
    _sysconfigdata      gettext             py_compile          warnings
    _testbuffer         glob                pyclbr              wave
    _testcapi           grp                 pydoc               wcwidth
    _testimportmultiple gzip                pydoc_data          weakref
    _testmultiphase     hashlib             pyexpat             webbrowser
    _thread             heapq               pygments            widgetsnbextension
    _threading_local    hmac                pytz                wsgiref
    _tkinter            html                queue               xdrlib
    _tracemalloc        http                quopri              xml
    _warnings           imaplib             random              xmlrpc
    _weakref            imghdr              re                  xxlimited
    _weakrefset         imp                 readline            xxsubtype
    abc                 importlib           reprlib             zipapp
    aifc                inspect             resource            zipfile
    antigravity         io                  rlcompleter         zipimport
    argparse            ipaddress           rmagic              zlib
    array               ipykernel           runpy               zmq
    ast                 ipykernel_launcher  sas_kernel          
    asynchat            ipython_genutils    saspy               
    asyncio             ipywidgets          sched               
    
    Enter any module name to get more help.  Or, type "modules spam" to search
    for modules whose name or summary contain the string "spam".

There are a few “key” packages missing in action for data science such as NLTK, scikit_learn, statsmodels among others.

One really neat feature of saspy is the ability to send SAS code and get feedback. This alone is a huge advantage for mixing up SAS and Python and open up a wide variety of scenarios where such mix can be a win-win combination!


c = sas.submit("""
proc contents data=sashelp.class;run;
""")

The above code, send to the SASsession() a proc contents statement and save results in the python variable c


type(c)

dict

As you can see, the variable c is a standard python dictionary and inspecting its keys we can see:


c.keys()

that the SAS log and SAS List are available. Please note that LST will only make sense for procedures that produce any kind of output such as proc print, proc means, proc tabulate, etc. In the case of data steps LST is empty.

dict_keys([‘LOG’, ‘LST’])

Let’s have a look at the output of the proc contents:


print(c['LST'][1:1000])

!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<meta content="SAS 9.4" name="generator"/>
<title>SAS Output</title>
<style>
/*<![CDATA[*/
.body.c > table, .body.c > pre, .body.c div > table,
.body.c div > pre, .body.c > table, .body.c > pre,
.body.j > table, .body.j > pre, .body.j div > table,
.body.j div > pre, .body.j > table, .body.j > pre,
.body.c p.note, .body.c p.warning, .body.c p.error, .body.c p.fatal,
.body.j p.note, .body.j p.warning, .body.j p.error, .body.j p.fatal,
.body.c > table.layoutcontainer, .body.j > table.layoutcontainer { margin-left: auto; margin-right: auto }
.layoutregion.l table, .layoutregion.l pre, .layoutregion.l p.note,
.layoutregion.l p.warning, .layoutregion.l p.error, .layoutregion.l p.fatal { margin-left: 0 }
.layoutregion.c table, .layoutregion.c pre, .layoutregion.c p.note,
.layoutregion.c p.warning, .layoutregion.c p.error, .layoutregion.c p.fatal { margin-left: auto; margin-right:

It looks awful!!! but there is a quick fix for this:


from IPython.display import HTML

HTML(c['LST'])
<!DOCTYPE html> SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name SASHELP.CLASS Observations 19
Member Type DATA Variables 5
Engine V9 Indexes 0
Created 08/10/2017 01:58:13 Observation Length 40
Last Modified 08/10/2017 01:58:13 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label Student Data    
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding us-ascii ASCII (ANSI)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1632
Obs in First Data Page 19
Number of Data Set Repairs 0
Filename /opt/sasinside/SASHome/SASFoundation/9.4/sashelp/class.sas7bdat
Release Created 9.0401M5
Host Created Linux
Inode Number 2438
Access Permission rw-r–r–
Owner Name sas
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len
3 Age Num 8
4 Height Num 8
1 Name Char 8
2 Sex Char 1
5 Weight Num 8

Now you can see the output as you would see it in SAS.

That’s all for now folks but keep in tune as I will be adding more content about saspy.

See you later!

Institute, SAS. n.d. “What Is Saspy?” SAS Institute. https://support.sas.com/software/products/university-edition/faq/jn_saspy_whatis.htm.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Negron (2018, Dec. 5). Data Addict's Secret Diary: Gentle introduction to SASpy. Retrieved from http://www.dataaddict.me/posts/2018-12-05-gentle-introduction-to-saspy/

BibTeX citation

@misc{negron2018gentle,
  author = {Negron, Alberto},
  title = {Data Addict's Secret Diary: Gentle introduction to SASpy},
  url = {http://www.dataaddict.me/posts/2018-12-05-gentle-introduction-to-saspy/},
  year = {2018}
}