Oh wow. I can't believe I haven't posted in the entire 2021! It's been quite a ride. Ever since the pandemic started, time seems to be stagnant for me. Though I've been out a bit more than 2020, my mind was still in 2019 when talking about the last trip (well, technically it should be in October when I visited Seattle. Perfect timing as I didn't get to experience fall anymore since relocating to Texas). I just wrapped up another project with an intern this semester, which was related to transfer learning on BERT. Though not writing the codes myself, I always feel that I learn more when mentoring an intern (partly due to my role as a resource provider, I ought to know more). There will be another post to talk about BERT and what we learned, but this one would be short & sweet. (AWS is GREAT 😀)
If you're using the public version of AWS/no complicated corporate setup, then this post is not for you. But for us, since we take customer data seriously, there are a lot of restrictions. I feel that my hands are tied due to limited access my role provided (AccessDenied
, yeah it's you again). On top of it, due to internal proxy setup, we are not allowed to run simple requests such as:
nltk.download('punkt')
Bribe IT and ask the admins to reconfigure my network setup might be a solution, but again this violates the company rule :) Since we had to use basic NLP packages to process texts further on customized AWS, I finally was able to find a way (using NLTK as an example):
- Download the NLTK data from the company-approved mirror site (If it's not available in your internal repositories, talk to IT/the open-source team again).
- Upload the package (in my case, it's
gh-pages.zip
) to S3. - In your SageMaker notebook instance, download the package from S3 (make sure that the IAM policy was properly set up - in other words, SageMaker can connect to S3):
# source: https://gist.github.com/mikhailklassen/de3da3584c45cedb5b0df7feaead6b1f#file-download_file_s3_sagemaker-py
# AWS Python SDK
import boto3
# When running on SageMaker, need execution role
from sagemaker import get_execution_role
role = get_execution_role()
# Declare bucket name, remote file, and destination
my_bucket = 's3-bucket-name'
orig_file = 'full/path/to/file.p'
dest_file = 'local/path/to/file.p'
# Connect to S3 bucket and download file
s3 = boto3.resource('s3')
s3.Bucket(my_bucket).download_file(orig_file, dest_file)
- Once the files are transmitted to your SageMaker notebook container, unzip it:
unzip gh-pages.zip -d /path/to/directory
- Load the file manually:
import nltk
nltk.data.path.append("/path/to/directory")
punkt
should then be able to use. You may need to be careful on the append path to make sure that it's one level above punkt
, also the package should be already unzipped.
I doubt if the trick I provided here would ever be useful for anyone on the public web. My lesson learned here is: to test things in a constrained SageMaker notebook container, better upload the files to S3 first and then download them from there. Direct upload to SageMaker notebook instance will fail miserably (partial data transmitted, plus the speed is way too slow).