Reading large SAS datasets in Python

Read large datasets in chunks

Alberto Negron true
12-04-2018

General speaking there are 2 approaches to reading large SAS datasets in python (that I know of!!!). In this post I will cover the easy one which is reading the big fat dataset in chunks.

Imaging you have SAS dataset with 100 million rows and in 12GB in size. Also let’s say the RAM available to you is also 16GB ( think cluster, laptop, etc). Having only 4GB to process your dataset (and any other resources that may be up & running) can be a cumbersome task where performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

One way to avoid this conundrum is to break the dataset in smaller parts that are faster to read. Luckily the sas_read method comes with an option to break the dataset in number of records.

In the example below I have broken down the heart.sas7bdat (available in the sashelp library) in chunks of 500 records each. Bear in mind this is a really small dataset that I am using for educational purposes only.

The strategy I use to decide on the number of record is fairly simple - create chunks that are around the 100 megabytes in size. The reason behind this is that performance is rarely a problem for data frames under 100Mb. For the case of 12GB data set you will get around 120 chunks that need to be processed.


import pandas as pd

sasds = pd.read_sas('/Users/anegron/projects/SASstudio/myfolders/output/heart2.sas7bdat',
                    encoding='utf-8',
                    chunksize=500,iterator=True)
type(sasds)

pandas.io.sas.sas7bdat.SAS7BDATReader

As you can see the options chunksize and iterator do not return a pandas dataframe but a SAS7BDATReader which is a python iterator. In Python, iterators have exactly one job: return the “next” item in our iterable object. I do not want down the rabbit hole to explain python’s iterables but if you need to take one thing with you, it would be Iterators save resources.


dfs = [] # holds data chunks
for chunk in sasds:
    dfs.append(chunk)

Chunks are released from memory when they are read and out of the loop. In the above code I am just appending each chunk to a list. In a real application, you would not use the dfs list as it would be meaning holding the whole dataset (split in chunks) in memory. I am using it here for (again) educational purposes only.

The list contains 11 chunks:


print(len(dfs))

11

Adding up all records across chunks give us the total number of recods of the heart data set


count_all_records = sum([len(x) for x in dfs])
print(count_all_records)

5209

The important point here to notice is that every chunk is indeed a proper pandas dataframe.


type(dfs[0])

pandas.core.frame.DataFrame

This means we can apply all we know about dataframes to each chunk.


print(dfs[0].columns)

Index(['Status', 'DeathCause', 'AgeCHDdiag', 'Sex', 'AgeAtStart', 'Height',
       'Weight', 'Diastolic', 'Systolic', 'MRW', 'Smoking', 'AgeAtDeath',
       'Cholesterol', 'Chol_Status', 'BP_Status', 'Weight_Status',
       'Smoking_Status'],
      dtype='object')

print(dfs[0].head(5))

  Status DeathCause  AgeCHDdiag     Sex  AgeAtStart  Height  Weight  \
0   Dead      Other         NaN  Female        29.0   62.50   140.0   
1   Dead     Cancer         NaN  Female        41.0   59.75   194.0   
2  Alive        NaN         NaN  Female        57.0   62.25   132.0   
3  Alive        NaN         NaN  Female        39.0   65.75   158.0   
4  Alive        NaN         NaN    Male        42.0   66.00   156.0   

   Diastolic  Systolic    MRW  Smoking  AgeAtDeath  Cholesterol Chol_Status  \
0       78.0     124.0  121.0      0.0        55.0          NaN         NaN   
1       92.0     144.0  183.0      0.0        57.0        181.0   Desirable   
2       90.0     170.0  114.0     10.0         NaN        250.0        High   
3       80.0     128.0  123.0      0.0         NaN        242.0        High   
4       76.0     110.0  116.0     20.0         NaN        281.0        High   

  BP_Status Weight_Status   Smoking_Status  
0    Normal    Overweight       Non-smoker  
1      High    Overweight       Non-smoker  
2      High    Overweight  Moderate (6-15)  
3    Normal    Overweight       Non-smoker  
4   Optimal    Overweight    Heavy (16-25)  

You can process every chunk is the same way you would have processed the original dataset and then output results to a database (using sql or sqlalchemy) or csv file (append).

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Negron (2018, Dec. 4). Data Addict's Secret Diary: Reading large SAS datasets in Python. Retrieved from http://www.dataaddict.me/posts/2018-12-04-reading-large-sas-datasets-in-python/

BibTeX citation

@misc{negron2018reading,
  author = {Negron, Alberto},
  title = {Data Addict's Secret Diary: Reading large SAS datasets in Python},
  url = {http://www.dataaddict.me/posts/2018-12-04-reading-large-sas-datasets-in-python/},
  year = {2018}
}