Moon river

New Domain!

Rose Lin — Mon, 24 Apr 2023 02:31:19 GMT

I love my .me domain, but it's getting expensive and I received so many unwanted emails. So effective today, my new domain for the blog would be blog.ruosilin.com! It's been a while since I set up a new domain, but it's not hard - I have Cloudflare to take care of DNS, certbot for HTTPS redirect, and (of course! always fun) Ghost set up to point to the new subdomain. In fact I didn't anticipate a .com domain, my original plan was to get a .ds but it's handshake only (in other words: active in the crypto world). Either way I'm glad with the new domain and would love to share more here!!

Hello from Ghost 5.x

Rose Lin — Sat, 24 Dec 2022 17:09:15 GMT

It took me ~5 hours to finally take a screenshot of this

A couple of notes (mainly for myself to refer to in the future):

It looks like Ghost prefers to have node upgraded through NodeSource Node.js Binary Distributions (source1, source2), but nvm works. Just make sure you get 16.13.0 (I thought 16.x would work generally, however 16.0.0 failed, so had to pin to the exact version)
Make sure that ExecStart in your systemd service file points to the current node version (which can be found through whereis node). Also after updating, cp the file to the systemd folder (or create a symlink... I have no idea why the one under my /system/files is not symlink to the one under /etc/systemd/system/) and run systemctl daemon-reload so that changes may take effect.
Somehow my production config has database = mysql2... if running into error you'll need to manually run ghost config set database.client mysql (this command will show up if you have config error when running ghost start)
It's possible (so far) to upgrade to MySQL 8 on Ubuntu 16.04, just remember to back up all the data. mysqldump -u root -p --all-databases > alldb.sql (source)
If ghost start keeps exiting with errors, try to run ghost run first and examine the console. You may see errors like WARN Can't connect to the bootstrap socket (localhost 8000) ECONNREFUSED but that's okay, the CLI will attempt 3 times and eventually give up. When starting as normal the production config file config.production.json should NOT have any bootstrap socket related info.
Reading the error logs definitely helps... journalctl -u your_ghost_service -n 50 and fix accordingly...

In the short run I don't anticipate another major update soon, unless another security vulnerability is exposed. Or maybe, as wzyboy has suggested... time to move to Lektor?

Migration from Twitter to Fediverse

Rose Lin — Fri, 23 Dec 2022 20:39:46 GMT

Well, it's been a year since I last posted here. I feel that I've easily lost track of time as I grow older, and by googling I found a plausible explanation. ChatGPT also provides another perspective, which is not entirely based on science but does offer a sense of comfort as well:

Back to the topic - though the whole year felt like a blink of an eye, a lot of things happened. Not only to me (on my personal level there are indeed some achievements & lessons I took notes of, maybe will share in another post?) but also to the world, and an important one is the recent Twitter fiasco. I first joined Twitter in October 2008 (fun fact: President Obama was one of my followers), and have been an active user since 2011 (one of the DAU for sure). To me, Twitter is not only a platform for information but also a place full of my digital life and memories. From there I met a handful of amazing people, and we even had some offline face time together which leads to everlasting relationships in the real world. I was able to reflect, relax and rewind. I picked up trendy topics and had in-depth conversations with avid practitioners in the same space, which is impossible given the physical boundaries among different continents. So Twitter has been, and is always special, in my mind. As a regular user, I don't want to talk too much about the recent policy changes since it becomes private. What motivates me to leave, though, is the following (looks like this act has been reversed, but still the scar is fresh in my mind):

Since the bloody layoff in early November, I've noticed Mastodon being mentioned a lot on my timeline. A quick search educates me that it is a decentralized, open-sourced social network that belongs to the greater Fediverse framework. I'd like to avoid using Mastodon to represent Fediverse though, just like you can't use Toyota as the synonym for the whole consumer automobile industry. Mastodon, though, is the most known, biggest Fediverse software implementation out there. I also have an account under the mastodon.world server, which I have no plan to stay forever as eventually I'll self-host an instance powered by Pleroma. For those who are still hesitant to make the move from Twitter to Fediverse, though, I hope to provide some tricks and observations so that your transition will be smooth as well.

Useful Tools

Mastodon instances: this site helps me quickly navigate and narrow down a list of candidate Mastodon servers to join. I like the Advanced list which allows me to rank servers by uptime, obs, and users. Rule of thumb: join the ones that are large enough with high uptime, so that you'll have the best user experience. Or maybe join your friends' servers, as word of mouth is the most basic but reliable source.
Fedifinder: this tool can scrape any Fediverse account published on Twitter from your following list. I didn't read the source code so unsure how it works, but I suspect it scans through your following's bio, user name and possibly the top 5 tweets to look for any known Fediverse instance link. We don't always have to use machine learning, simple pattern matching has an edge too :) I was able to follow ~85 friends in batch thanks to this tool.
Mastodon Twitter Sync: this package can automatically sync your toots to Twitter, and vice versa. Note: quote retweets on Twitter will be synchronized to Fediverse despite sync_retweets = False, that's because quote retweets != retweets. Also when synchronizing reblogs to Twitter, the original author name does not contain any server information. But no other complaints since it's a great free tool! It can also link your toots/tweets with your self-replies. Setting up using the Docker build is easy, but remember to take out the -t flag when running it in your crontab... my mailbox was bombarded with the error the input device is not a TTY when first set to run every 5 minutes...
Tooot: a Mastodon mobile client: it simply works with an elegant design. Though the main focus is for Chinese users, browsing toots in other languages (en & CJK based) is seamless. Sometimes, though, fetching the updates can be slowed. I wonder if it's an issue related to my server, though.

Observations

You don't have to follow users from the same server; as long as your server is not on the blocklist of another server, you should be able to search for the other user's handler (e.g. @chrisalbon@sigmoid.social) and follow him/her. If this is the first time that the other user is being followed by someone on your server, it will take time to fetch his/her timeline. Otherwise you should see a couple of toots already present on your side.
Refreshing timelines can be a frustrating experience compared with Twitter, since my server seems to be popular with only limited resources. So, support the server owner in whatever means, if you can! The latency is a bit high but bearable. I plan to opt to self-host Pleroma mainly due to it being lightweight, and I'd like to have more control over my data just as how I decided to open up a self-host blog like this.
The 140-character limit no longer applies here! However I lost the ability to express long, consistent thoughts on a social platform... I'll pick it up gradually :)
You cannot quote & reply to a toot on Mastodon. Looks like this is a deliberate design choice but I'll much appreciate it. There is a workaround, though. You can always manually copy the link to a toot and paste it at the end of your toot, just like how you'd comment on any link on the web. It's just... blindly inconvenient :)

I probably will come back to this post to add more observations/technical details as my time on Mastodon/in Fediverse progresses. So stay tuned! In the meantime, happy holidays! Can't wait to see how 2023 will unfold for us.

Quick Note on AWS Package Installation under Company Proxy

Rose Lin — Sat, 11 Dec 2021 02:37:06 GMT

Oh wow. I can't believe I haven't posted in the entire 2021! It's been quite a ride. Ever since the pandemic started, time seems to be stagnant for me. Though I've been out a bit more than 2020, my mind was still in 2019 when talking about the last trip (well, technically it should be in October when I visited Seattle. Perfect timing as I didn't get to experience fall anymore since relocating to Texas). I just wrapped up another project with an intern this semester, which was related to transfer learning on BERT. Though not writing the codes myself, I always feel that I learn more when mentoring an intern (partly due to my role as a resource provider, I ought to know more). There will be another post to talk about BERT and what we learned, but this one would be short & sweet. (AWS is GREAT 😀)

If you're using the public version of AWS/no complicated corporate setup, then this post is not for you. But for us, since we take customer data seriously, there are a lot of restrictions. I feel that my hands are tied due to limited access my role provided (AccessDenied, yeah it's you again). On top of it, due to internal proxy setup, we are not allowed to run simple requests such as:

nltk.download('punkt')

Bribe IT and ask the admins to reconfigure my network setup might be a solution, but again this violates the company rule :) Since we had to use basic NLP packages to process texts further on customized AWS, I finally was able to find a way (using NLTK as an example):

Download the NLTK data from the company-approved mirror site (If it's not available in your internal repositories, talk to IT/the open-source team again).
Upload the package (in my case, it's gh-pages.zip) to S3.
In your SageMaker notebook instance, download the package from S3 (make sure that the IAM policy was properly set up - in other words, SageMaker can connect to S3):

# source: https://gist.github.com/mikhailklassen/de3da3584c45cedb5b0df7feaead6b1f#file-download_file_s3_sagemaker-py
# AWS Python SDK
import boto3

# When running on SageMaker, need execution role
from sagemaker import get_execution_role
role = get_execution_role()

# Declare bucket name, remote file, and destination
my_bucket = 's3-bucket-name'
orig_file = 'full/path/to/file.p'
dest_file = 'local/path/to/file.p'

# Connect to S3 bucket and download file
s3 = boto3.resource('s3')
s3.Bucket(my_bucket).download_file(orig_file, dest_file)

Once the files are transmitted to your SageMaker notebook container, unzip it: unzip gh-pages.zip -d /path/to/directory
Load the file manually:

import nltk
nltk.data.path.append("/path/to/directory")

punkt should then be able to use. You may need to be careful on the append path to make sure that it's one level above punkt, also the package should be already unzipped.

I doubt if the trick I provided here would ever be useful for anyone on the public web. My lesson learned here is: to test things in a constrained SageMaker notebook container, better upload the files to S3 first and then download them from there. Direct upload to SageMaker notebook instance will fail miserably (partial data transmitted, plus the speed is way too slow).

Causal Inference: Basics & My Thoughts

Rose Lin — Wed, 02 Dec 2020 02:05:15 GMT

Well, technically I didn't apply the technique to solve a problem myself - as I served as a mentor in this case. Long story short, almost a year ago I was assigned to a project which aimed to identify top attributes associated with a target. Eventually, we found a handful of variables, and our client would like to drill down specifically on one of them. Based on the sample data, I checked the association between that variable and the target, and the results were inconclusive. The project was then put on hold, but it lingered in my mind. I then learned that there is an active research field called Causal Inference which seems to provide systematic ways to answer the core question that my client has: that is, if we intervene on variable X, will it cause changes on the outcome Y (regardless of positive or negative)? Causal inference sounds like a perfect tool to address this kind of question, therefore I framed a research question and requested an intern to work on it. It turns out to be a bit challenging than I originally thought, though. I'll try to summarize the basics we've covered for this project, which I believe are the stepping stones for advanced topics in this field. Hopefully, I'll be able to finish grasping the main ideas from ECI then!

So, What Is Causal Inference?

Per Wikipedia, causal inference...

is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect.

It could be described in many different ways under various contexts, but to me, it means: can we say that treatment causes a change in the outcome?

Treatment: the intervention that we take.
- For instance, whether arranging heart transplant for a patient or not; whether changing the wording of a homepage or not. These treatments are binary.
- If there are multiple treatments within a timeframe (e.g. yearly follow-up with AIDS patients over 5 years, each time providing a different dose of medicine) that will be considered as time-varied treatments, which I have no exposure so far (the Causal Inference book dedicates a whole section for it, and I found it quite difficult to follow).
- I wonder if it's possible to have continuous treatments... would be quite hard to do so, especially in an experimental setting? Multi-level discrete treatments are more likely, but fortunately, as my first project involving causal inference, I just need to deal with binary treatment.
Outcome: the end result of a study/an experiment/something that we are measuring.
- For the 2 examples that I've shown above, their outcomes could both be "success" (whether a patient survived or not; increase in revenue/CTR).
- Outcome could be discrete (Yes/No) or continuous (e.g. change in revenue).

Randomized Experiments vs. Observational Study

If a causal analysis is performed in a control experiment setting, it would be fantastic. I'm not an expert in experimental designs, but randomized experiments really help (also make it easier to check the 3 conditions above). In reality, though, it could be that we are simply collecting a snapshot of data retrospectively. These data would be considered as observational data, as they were passively captured. My impression is that if one would like to draw a conclusion using observational study data, he/she should gather a large enough dataset (here comes another question: what do I mean by "large enough"? Well, it depends) with a handful of measured variables. Even then, causal inference is built upon a series of assumptions, so clearly stating the assumptions and limitations of the result is crucial.

Identifiability Conditions

The following 3 conditions should be checked prior to conduct a causal analysis:

Exchangeability: each subpopulation within different stratums should be exchangeable. For instance, had I prescribed this drug to patients who do not receive it at the moment, their responses should be the same as those who currently get the drug.
Positivity: each subpopulation should have at least 1 instance. Say that I don't have any male samples that live in Boston, age > 50 with > 10K annual income and 2 dogs, my study would fail to meet the positivity assumption.
Consistency: the observed outcome should be the same as the counterfactual outcome for every treated/untreated sample. Let's assume you notice that taking Aspirin will noticeably reduce the risk of heart attack by 5%. This should be consistent across the board - regardless of the brand.
- I personally find the idea of consistency a bit abstract to understand/explain. The "Causal Inference: What If" book mainly refers to different versions of treatment (e.g. physicians have their own preferences on conducting the same operation), but I still feel that something is missing, at least in my understanding. If you could find a better example to illustrate, let me know!

If dealing with observational data, I doubt if the 3 conditions could all be satisfied flawlessly. My approach, for now, is to consider them first when dealing with a causal problem, and attempt to remedy any missing pieces; if not, I will highlight the failing ones and clearly emphasize them as part of the limitations of this work.

Causal Diagram

It typically looks like this:

Image source: Causal Inference: What If, Miguel A. Hernán, James M. Robins, February 21, 2020, pp. 69

Per Wikipedia, "A causal diagram is a directed graph that displays causal relationships between variables in a causal model". I find it useful to sort out variable relationships, especially when you have a handful of them to analyze. The one displays above shows a simple scenario: that is, we have a set of confounders L that affect both the treatment A and the outcome Y. Y is also impacted by A. Causal diagrams are capable of conveying more information, such as blocking, colliders, etc.

I didn't get to play with the dowhy package from Microsoft for this project, which requires a causal diagram as part of the input. I perfer to translate the problem into several causal graphs after verifying the identifiability conditions and finishing exploratory data analysis on core variables.

Confounders & Effect Modifiers

The two concepts seem similar but indeed they're not. My take on both:

Confounders are variables that affect both the treatment and the outcome. Let's say that I'm interested in this newly approved medicine that claims to reduce pain by 25%. One measured factor could be pregnancy (0/1). If a patient is pregnant, I might play safe and adjust the treatment doses as compared with non-pregnant subjects; pregnancy might also impact the pain level that a patient could tolerate, thus affecting the outcome too.
- Confounders need to be adjusted so that we could achieve marginal exchangeability: that is, switching subjects in each subpopulation should achieve the same result. If not adjusting, the conclusion will be biased, as we no longer can say that the treatment is the only variable that causes changes in the outcome (in the graphical representation, that would be a "backdoor path").
The formal definition of effect modifiers is: an effect modifier V modifies the effect of the treatment A on the outcome Y, when the average causal effect of A on Y varies across levels of V. If we find that women are more likely to survive after a heart transplant than men, then gender would be considered as an effect modifier.
- I had an in-depth discussion under the comment section of this video. Back then my mindset was wrong - I thought that effect modifiers == variables that only impact the outcome. That is not correct as I didn't take the treatment into consideration. Variables that only impact the outcome but not the treatment could be excluded from causal diagrams too, as they won't create loops.
Effect modifiers could be confounders; back to the previous example - if we took gender into consideration when assigning hearts to patients (e.g. female are 30% more likely than male to receive a heart) then it has an impact on the treatment too, therefore it will be a confounder. They could be non-confounders as well if the treatment procedure does not involve them.
Confounders do not have to be effect modifiers. Say that in the heart transplant experiment, patients who are over age 50 are 30% more likely to receive hearts than those under age 50 (i.e. age > 50 affecting the treatment); elderly people may have a shorter life expectation (i.e. age > 50 affecting the outcome). However, in this study, we do not see a change in average causal effects in the over- and under-50 age groups. If so, age > 50 will NOT be an effect modifier, but still a confounder.
Rule of thumbs: adjusting confounders whenever you see one (regardless of effect modification or not). Effect modifiers, if exist, worth reporting too. But you may need to take some time understanding their meanings prior to sharing them with key stakeholders.

G-methods

G-methods are a collection of techniques to understand generalized treatment contrasts involving treatments that vary over time. It includes IP Weighting, standardization, and g-estimation. For this work, we applied IP Weighting & standardization to a treatment variable that does not evolve over time (i.e. the treatment was given once and done). I will summarize my understanding and key steps for both algorithms below, but they may be incomplete in the time-varied treatment case.

IP Weighting

IP weighting is a mechanism that removes the dependency of the treatment A on covariates L. It is able to do so by creating "pseudo-populations". For each covariate group and each treatment combination, there will be some responses for the outcome group. IP weighting essentially takes the binary tree and splits it into two parts, one with all untreated and the other with all treated. The distributions of L are thus the same in both groups. I find this concept hard to grasp. If you are dealing with binary, one-time A and a set of L, one parametric way to achieve IP weighting is:

Estimate the propensity score model, i.e. P(A=1|L). My suggestion is to start with well-understood linear models first, e.g. logistic regression.
Estimate weights using output probabilities from the propensity score model. For samples whose A = 1, their weights would be 1/p. Those that have A = 0 would have their weights equal to 1/(1-p).
Use the weights above to estimate the outcome model, i.e. P(Y|A). If using generalized linear models (GLM), you may perform weighted least squares and interpret the coefficient associated with the treatment variable.

This paper suggests that one may try truncating weights. The "truncation" does not mean throwing away data; rather, it's a method to adjust for weird cases, such as samples that are supposed to have a high likelihood of receiving treatment (P(A=1|L) is high) but ended up not getting treatment (A=0), and vice versa. Or you may compare the weighted average responses over the treatment and the control group (should be equivalent to the coefficient obtained from 3. above). I personally value IP weighting over Propensity Score Matching.

Standardization

Unlike IP weighting, which approaches from the treatment side, standardization approaches from the outcome Y side. The formula looks like below:

Image source: Causal Inference: What If, Miguel A. Hernán, James M. Robins, February 21, 2020, pp. 162

I like to think of it as a form of weighted average: we obtain conditional expectations per each treatment (A = 0 or 1) & covariates (L) combinations, then marginalize over L (i.e. "standardizing" the expectations). We will be comparing two expectation values here (for binary treatments), one with A = 1 and the other with A = 0. The causal effect estimate will be E(Y|A=1, L=l) - E(Y|A=0, L=l).

As for implementation, the code accompanying Hernán & Robins's book summarizes it nicely:

outcome modeling, on the original data

prediction on the expanded dataset

standardization by averaging

Standardization and IP weighting are mathematically equivalent, so they should produce results that agree with each other. If not, something is off (so-called model misspecification).

Things I Wish to Know More

I rushed through the first two parts of Causal Inference: What If in less than 3 weeks, so as to finish up the project scoping on time. There are terms that I ran into but didn't dig through. Some of those include doubly robusted method (a parametric model that is said to work with incorrectly specified treatment or outcome model), sufficient causes (how is the graphical representation relates to causal diagrams?), and instrumental variable estimation (the authors do not sound like a fan to this concept, but I wonder how I may find an ideal set of instrument variables to estimate from another angle). Moreover, I'm interested in incorporating machine learning (e.g. tree-based ensembled models) into causal inference (such as using random forest to estimate the conditional mean, E(Y|A)). I believe this is technically doable, just unsure of the implications and interpretations (time to revisit ECI?). Another general item is to build a knowledge map - at least in my mind. Now I feel that my knowledge is here and there; though I'm aware of certain terms and concepts, how are they connected? Therefore, I'm taking a step back to read Pearl's classic premier. Brady Neal has put together a fascinating flowchart on what book to read, so if you're new to the field too, check it out!

End

Ah, it took me more than a week or so to finish this blog post! I didn't proof-read though, so please bear with me for grammatical mistakes. For conceptual misalignments, leave me a message & I'll be more than willing to investigate.

I normally write technical notes after going through major challenges, so this blog post is no exception. My original goal is to lower the entry barrier of this field - it was painful to go through the literature as a complete novice. If anyone finds this interesting/helpful, that's a great comfort to me. I might write another blog post to share my journey in causal inference (if only... I have the time and not procrastinating), so stay tuned :)

A (series of) Mysterious Flask Error(s)

Rose Lin — Tue, 14 Jul 2020 18:25:23 GMT

The company-wise hack day for 2020 is tomorrow. It's my first time participating, so I'm a bit excited. We have an awesome idea to implement, which requires a simple Flask web app. Flashback to late 2018, I did write a simple Flask app by following this tutorial so that my hubby (or other interested parties) may upload cat photos to my external storage on Cloudinary. It worked, but uploading multiple photos always timed out. (Heroku seems to allow for max. 30 seconds idled time) Now that I contemplate this, I realize that I could have handled it through JS async call. But that's another story. In short, I'd like to reuse some legacy Flask codes that I somehow wrote in Python 2.7 (weird, right? Already late 2018, and the tutorial is based on Python 3, but I was still attached to Python 2?) as a boilerplate for my new project. I didn't imagine it would be that hard, though.

Import within the same folder no longer works?

Below is a snapshot of my folder structure:

+-- app
|   +-- __init__.py
|   +-- config.py
|   +-- ... (Other py files)
+-- migrations
|   +-- (Auto migration files created by SQLAlchemy)
+-- myapp.py
+-- requirements.txt

Within __init__.py, I have from config import Config. config.py is the configuration file with a Config class object. That's what the tutorial has, and works perfectly fine with Python 2.7.

After installing all the required packages, I specified the Flask app by setting export FLASK_APP=myapp.py, followed by flask run. The following error occurred:

File ".../app/__init__.py", line 7, in 
  from config import Config
ModuleNotFoundError: No module named 'config'

That's weird. I have a config.py resided in the same folder with __init__.py! How come it's not found?

Googled Solution 1: Out of sight, out of mind

The first search term that occurred in my mind was a relative import. (Spoiler alert: this turned out to be NOT related to relative import.) I quickly googled and found a solution:

If your config.py file and init.py file on the same path, move config.py file outside of the path and place the same in root.
for example if both are in “APP” directory, move the config.py and place it inside the root director of “APP” directory

Wow, sounds like exactly what I've been going through so far! Can't wait to try it!

After moving the config.py to the parent folder (aka. not within the app folder), I tried to invoke flask run again. This time, I got similar errors from importing other python files within the same app folder for the __init__.py file. There is no way I could move all files to the parent folder! Something is bound to be wrong in this case.

Googled Solution 2: Add system path

I dig a little deeper on Python 3 relative import and believed that this should be my cure:

I had a similar problem, I solved it by explicitly adding the file's directory to the path list:

import os
import sys

file_dir = os.path.dirname(__file__)
sys.path.append(file_dir)

After that, I had no problem importing from the same directory.

I copied and pasted the code to __init__.py and voila! Not only config.py could be seen, but also other files under the same app folder!

... But what about the local database?

The app has a SQLite local database that contains only one table to store registered user information. I was still not able to run the app due to the following error:

sqlalchemy.exc.InvalidRequestError: Table 'user' is already defined for this MetaData instance.  Specify 'extend_existing=True' to redefine options and columns on an existing Table object.

Per this suggestion I deleted all legacy .pyc files inherited from the repository. The error persisted. I tried this solution and it did an amazing job paving the path clear for me. Until I found out that I was not able to create new user anymore. In other words, the local database was corrupted, and I could not upgrade/downgrade it anymore.

Correct way: still around file imports.

I almost gave up after googling and trying different things for about an hour. Intensive coding is fun, but I would like to get the ball rolling ASAP, or it's better to start from scratch.

I prefer to edit codes directly from text editors such as Atom/Notepad++. But I do have IDEs such as PyCharm, which I love to work with when navigating projects with complex code structure. Somehow I found that PyCharm highlighted the row from config import Config with red squiggly lines. I removed the whole sys.path.append(file_dir)... block from __init__.py and tried one last thing: from app.config import Config. And... the red squiggly lines went away? Inspired by this finding, I immediately fixed all imports within the app folder by adding "app" in front of the filename. The problem is finally solved!

Afterthoughts

I wonder if it is because the app folder contains __init__.py thus Python 3.7 considered it as a package, therefore the "app" prefix is required to import the remaining files within the same folder. I vaguely recall myself reading a note on how frustrating it could be when dealing with Python import system. I probably should read it again.

TL;DR

A legacy Flask app could not run when migrating to another machine. It first complained that modules within the same folder cannot be found. Using some Googled trick I messed up the SQLite local database.
It's an issue with the Python import system.
I hate database migrations. (One main issue the team always ran into when I was a back-end intern for a web app)
Googled solutions do not always work; try at your own risk, but they may inspire you to find the right way.

And I hope I could bring some good news next time! Looks like only after dealing with a tricky tech problem do I have the impulse to update here...

My (Not-so-Tough?) First-try with Docker

Rose Lin — Mon, 04 May 2020 01:15:04 GMT

Foreword

Learning Docker has been on my to-do list for a while. When I say "a while", I mean it. (Circa 2017, or even earlier?)

I won't list the advantages of containerization here (as Google can tell you more), but I'm a practical learner. Which means I learn things by actually doing it. Back in 2017, I started this Telgram catbot project to share the love of my two cats with the world (side goals: improve my Python coding skills/learn Git). It was originally deployed on Heroku; given that I was on the free "development" tier and the popular demand (mostly from my friends), my "dyno hours" (aka available resources) quickly exhausted every month. I had to shut down the project last year so that my other projects could stream normally on Heroku. When I did the portfolio refresh this year, I realized that VPS is an ideal place to host my projects; it gives me more freedom than Heroku does (plus, I've paid for it every month already). So which one should I migrate first? Catbot, my first and foremost (serious) project, becomes the top choice.

Dockerizing...

I won't spend too much time describing the efforts it took on reactivating the catbot project itself, as it's not too relevant. I do want to mention, though, that the bot was written in mid-2017 with Python 2.7. (I don't know what I was thinking back then, either - I didn't fully embrace Python 3 until early 2018 though) Also, the Python-telegram-bot framework recently received a major update on callback (now context-aware). Heroku only supports PostgreSQL, but with Ghost my server already has MySQL nicely set up. Moreover, Heroku recycles my applications once per day, but that won't be the case for my server, so I need to have the job set up correctly for the daily push to users. Took me a day to fix all these nuances, and I had the bot up and running in the local mode!

I'm now ready to embrace Docker. However, I have no experience other than going through the official Docker starter doc once. That did not impress me, as I vaguely remember terms as "images", "hub", etc. What should I do?

I'd like to give credits to people on Twitter, though: below is a list of useful resources provided by them:
Production-ready Docker packaging
Docker - 从入门到实践 (Note: zh-cn only)
Also Google! :)

Somehow the two terms, "yaml" and "Dockerfile", came to my mind first. What do they mean? Do I need both to run my bot? In a nutshell, the bot depends on a Python script, which communicates with a local database that persists user information. (Since version 12, Python-telegram-bot natively supports persistence. The only example is about persisting conversational handlers, which is irrelevant to my case. I browsed some other posts online, and they all suggested sticking with databases for persistence. So I did not break the tie with databases) Besides, the bot talks to Cloudinary and needs a Telegram bot token passed from the environment. Thus, I need to take all API secrets into consideration as well.

I chose to start with "yaml", which is Docker composer. :)

Challenge 1: cannot run `docker-compose up`

ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a version of "2" (or "2.0") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

I did not have docker-composed install on my sandbox originally, so I followed the command prompt as it suggested me to grab the app through apt-get. Then I tried docker-compose up again, and it returned the error as shown above. I could change the version from 3 to 2 and make it work, but I wonder why. I have Docker version == 19.03.8, which is supposed to support version 3. This comment helped me out, seems that I didn't get docker-compose installed correctly through apt-get.

Challenge 2: ... my container exits itself?

Yes, it always quitted with exit code 0. I'm following the example here, but my yaml file looks like below:

version: "3.0"

services:
  catbot-container:
    image: python:3.6-slim-buster
    env_file: env.list

... it just exited. Not until now did I realize that I started an empty container with nothing in it. Thus it gracefully exited. Since my app is a simple one and does not require multi-container interactions, I decided to give Dockerfile a try.

Challenge 3: I really don't know how Docker works.

I adopted the "somewhat better image" example in this post as a starting point and made some modifications. Of course, I did not get it right the first try. After trials and errors, I believe the groundwork is finally done. What's next?

Well, I then managed to run docker build. Looks like this command somehow turns my configuration into an image that is replicable across different machines.

The build succeeded. What does that mean? After couple hours of googling, I found that I should run docker run. There are some flags that I have to be mindful about.

I'd like the container appears to be running on the host itself (from the perspective of the network) so it could seamlessly connect to the localhost database, so I added --net=host
There are too many secrets to pass in single run command, therefore it's loaded from a file called env.list: --env-file=env.list
My app will not quit itself easily. I would prefer it running silently in the background while I'm working on some other stuff. Thus, --detach.
Even after I reboot my server, the container should come back to life ASAP. I updated the restart status afterward, but could have done so in the run command by adding --restart=unless-stopped

This seems to be a long run command to me, but I'm pretty sure there are longer ones out there (maybe why people eventually choose to use composer to handle all these?). After running docker run --detach --env-file=env.list --name NAME --net=host BUILD, I checked the status of the container and boomed! Everything is up and running.

Now

Sometimes, a picture worths more than a thousand words:

I tried restarting my server a couple of times, and the cat bot is still up and running :O. You are more than welcomed to test it out here (A valid Telegram account required). Based on the logic, the daily push is supposed to be functioning. We will see how it goes tomorrow.

The original title is "My (Tough) First-try with Docker". I was still unclear about docker compose and Dockerfile back then. I feel much better now though after actually building an image and starting a container myself. It only took me half day (less than 4 hours)! I can see it comes in handy for reproducible machine learning projects (passing my beautifully boxed DL codes to other DS) without the hassle of manual setup. Also, I love the restart option - no more tmux session gone after system restart!!

Hope you enjoy this little post (with grumbles) and stay safe during this uncertain time. It is said that staying at home is an ideal time to pick up new skills; but still, physical health comes first. Working on resurrecting legacy projects does bring joys to me during this remorseful time, though.

Why My Saved Model Is Not Working as Expected?

Rose Lin — Tue, 14 Jan 2020 02:06:06 GMT

Update 2/6: After reporting our findings back to the client, we found out that the dataset we used was biased. Not the typical "target = 1" cases, but the training data did not reflect the reality well. Therefore the model suffers, too. "Garbage in, garbage out." The team has learned the lesson and the next dataset would be holistic and representative.

Well - the title pretty much summarizes what I have been doing in the past week.

Long story short:

there was a business problem to solve. (Due to intellectual property rules, cannot say more beyond this. Think of it as a binary classification problem.)
We (Data Scientists) came up with a solution (a model for sure).
Before deploying the model into production, we tested it out for a certain period and collected the results.
Due to the time lags in some additional processing, the results were not available until last week.
We were then shocked by the news: for the first three groups that were supposed to have a precision of 40%, we only got 9%.

WTF??

Seriously, that was my first thought. Moreover, we (data scientists) did not see any metrics other than the summarized statistics (Basically a statement on a slide: for the top 3 groups, the precision is 9%. Model precision: 40%). Though the training set we used was not the most recent, we certainly did not expect to see such a dramatic difference. (Around 5% fluctuation is understandable) But we have told the engineers that they should start implementing this model in two weeks! We had to miss the deadline once due to the developers' limited capacity. If we wait again, it would be another three months. Plus, once our managers know they would need a model diagnostic writeup.

Model Diagnostics

To save our butts, I reached out to analysts who collected the testing results and calculated the performance figure. She sent me back a spreadsheet with unique identifiers (objects that we are predicting) and some attributes. Without firing up any weapons from my toolkit, I realized the first problem: attributes reported in the validation report do not match with the target requirement.

For instance, let's say that we would like to build a model to predict whether a given day would be ideal for a day trip. For a day to be qualified as "ideal", it must meet the following two criteria:

The expected $ I spent would be less than $500
The weather between 10 am to 12 pm is not rainy

But the report only contains information such as "weather between 8 am to 11 am", and "the expected $ the team spent". These two pieces of information are then put together and treated as the target. You can't say there is no value in recording the two attributes; in fact, there is an overlap. However, this is not what we agreed upon. When gauging the model performance, we need to refer back to the model definition and capture exactly the needed features. Otherwise, it's not an apples-to-apples comparison.

Another problem lies in the model attribute distributions. I compared how the distributions of selected features differ from the training set, testing set, and (essentially) the holdout testing set. Very few features have similar distributions. A machine learning model learns what was presented to it during the training time. This is not even a generalization problem - over time, some features may have drifted and that is a call for more recent data. Consider Twitter spam behaviors. If the model was trained using 2014 data, of course, the model will fail in finding 2020 spammers. There is no guarantee that over time, features in our model have constant distributions.

The third problem is the most interesting one, and it deserves a section itself.

Be careful with Pipelines...

When I trained a model on the server and immediately used it to predict the training dataset (Caution: THIS PRACTICE IS NOT ENCOURAGED), each instance got an assigned probability. I saved the model to my local workstation, load the same training dataset, and used the saved model to generate predictions. Probabilities changed for the same instance! Why? I have the seed set, and the model configuration (i.e. parameters of the model) is what I need. The way I calculated model performance metrics (accuracy, precisions, etc) is correct. The model pickle file was not contaminated with random noises. Could it be because of the different environment setups? I was under the impression that a trained ML model should always spit out the same output given the same input! After spending a whole morning debugging (in Jupyter Notebook), I was pulling my hair out, crying.

At first glance, this seems to be a problem with recent scikit-learn updates. My local was running 0.22.1 while the server had 0.21.3. It should be a major update, because my local machine refused to load the saved pickle file at all. For compatibility, I had to downgrade scikit-learn on my end. Now the local environment loaded the saved pickle file correctly, but the predictions were still everywhere!

Side note: instead of the model (e.g. SVM, GBM, logistics regression etc.), I saved the whole sk-learn Pipeline object. The pipeline contains some preprocessing steps, such as handling missing values. But these steps are bare minimal. Before training the model, I had to take an extra step to clean the training data, such as removing dots and dollar signs, binning into different groups, keeping top n values, etc. These preprocessing are considered as "data transformations" and thus not included in the model pipeline.

... You probably have an idea of what went wrong now after reading my notes above. And that's exactly the problem - when I unpickled the saved pipeline object locally, I fed it the raw training data directly. The data transformation step was omitted completely at the inference time. During the training phase, the model saw a feature split nicely into 10 bins; but at the inference time, even the input data is the same, because of the missing data transformation step, now the model had to deal with the raw data instead of the transformed 10 bins. After adding all required data transformations, the saved model pickle returns the same result for a given instance.

So what could I do?

Fix the Pipeline and pickle again, so that the saved pickle contains the required data transformation. In production, the model handles raw data naturally.
Have the engineers transformed the data first before sending it to our model.

2 is the preferred way by our developers, as they have some clever procedures to tweak data transformations in real-time. So all we need to do now is to apply the transformation on the holdout, beta test data and collect model outputs. Since the field test already concludes, with the correct target information we can measure how well the model is performing.

In summary

Don't be afraid if your model performs poorly on unseen data. Take the steps to debug it (I would have to say: different from debugging codes. I miss VS debuggers!):

If you are not the one who collects test results, ask the relevant party to provide information. Specifically, what attributes were captured, how were they calculated the performance metrics.
Compare against your training data to see if attribute distributions change over time (think of Twitter spamming behaviors - spammers change frequently to evade rules).
If data pipeline/preprocessing is involved, check if raw data was exposed to the model directly, with any critical preprocessing steps missing.
Keep asking questions! (Even under tight deadlines)

I should have taken a closer look at the data transformation step earlier, but naively I thought it was handled by the sklearn Pipelines already. It's not until I dissected the pickle file that I realized how data transformation came into play. Also: don't feed your trained model with training data again. It's not a good way to judge model performance... overfitting is inevitable unless you know what you're doing (in this case, I do). Happy coding! Hopefully, I could update more frequently in 2020; I do feel that mistakes are my main drive force for writing blogs nowadays😂😂😂

Feature Engineering for Machine Learning - Chapter 2 note

Rose Lin — Tue, 27 Aug 2019 02:11:29 GMT

Feature Engineering for Machine Learning

Ch.2: Fancy Tricks with Simple Numbers

Most ML algorithms can only take numerical inputs. Feature engineering on numeric features: basic & critical.
First sanity check: magnitude. Would a coarse granularity work instead of the actual value?
Tricks to deal with counts, which may grow without limitation:
- Binarization: turns the problem into 0/1, occurred/not occurred.
- Binning: useful when the distribution is heavy-tailed
  - Fixed-width binning: Easy to compute, but may end up with multiple empty bins. 1) heuristics (e.g. group by age), 2) log for exponential-width binning, dividing by a constant for linear-width binning.
  - Quantile binning: adaptively positions the bins based on the distribution of the data. Quantile, decitile, etc.
Log transformation: effective for heavy-tailed distribution. "It compresses the long tail in the high end of the distribution into a shorter tail, and expands the low end into a longer head."
- It's not a one-time, save-all technique though. Need to inspect the relationship between the transformed feature and the target to see if model assumption holds.
Power transformation: extension of log transformation
- AKA "variance-stabilizing transformations": e.g. on Poisson, power transformation removes the dependency between variance & mean.
- Box-cox: a simple generalization of both the square root transform and the log transform. Only works for positive data.
Feature scaling: doesn’t change the shape of the distribution; only the scale of the data changes. Useful where a set of input features differ wildly in scale. For certain model, drastically varying scale in input features may lead to numeric instability.
- Min-Max Scaling: squeezes (or stretches) all feature values to be within the range of [0, 1]
- Standardization (Variance Scaling): scaled feature would have mean = 0 and variance = 1
- L2 normalization: outputs will have norm 1. Not necessarily in the feature space; could be in the data space as well.
Don't scale sparse data -> dense vector; huge computation burden on the classifier!
Complex features, e.g. interactions (products of two features)
- Do NOT need to curate manually for decision tree-based models; could help generalized linear models (to leverage logical AND relationship)
- Easy to compute, but increases computation complexity (now has to consider second order features as well)
- To account for the computational expense of higher-order interaction features: feature selection & handcrafted small sets. Both have their advantages & drawbacks.
Feature selection: prunes away non-useful features in order to reduce the complexity of the resulting model.
- Filtering: preprocess features to remove ones that are unlikely to be useful for the model. Computationally cheaper than the wrapper methods, but lack consideration for the underlying model. (May not select the right features for the model)
- Wrapper methods: essentially using a subset of features and gradually growing. Expensive, but not doing pruning upfront. "The wrapper method treats the model as a black box that provides a quality score of a proposed subset for features. There is a separate method that iteratively refines the subset."
- Embedded methods: feature selection is included as part of the model training process. Less powerful than wrapper, but not computationally expensive; better than filtering as it relies on the underlying model. A balance between computational expense & quality of results.

Make a fair coin from a biased one, and more...

Rose Lin — Sat, 17 Aug 2019 22:06:43 GMT

I read about this question online today and found it interesting [source]:

You are given a function foo() that represents a biased coin. When foo() is called, it returns 0 with 60% probability, and 1 with 40% probability. Write a new function that returns 0 and 1 with 50% probability each. Your function should use only foo(), no other library method.

You can easily find the solution in the link above, but I thought it in the wrong way: my first attempt was to return 0 if two tosses return (1,0) and (0,1), 1 otherwise. Apparently the math does not add up here: P(1,0) + P(0,1) = 0.52. Later I learned that John von Neumann has proposed an elegant algorithm to solve it (though not 100% efficient). Naturally, I asked myself: can I use the same unfair coin to generate 1, 2 and 3 with equal probability (i.e. each with chance 1/3)?

... I took the wrong path again. Trying to group all combinations by tossing the coin 3 times into 3 groups: (000, 111) -> discard & continue to toss; {1 one, 2 zeros} and {2 ones, 1 zero}. Unfortunately P(1 one, 2 zeros) != P(2 ones, 1 zero) plus I can't get 3 numbers out of this grouping mechanism. Someone on Twitter proposed the following:

Generate 3 bits. Return 0 if 100/011, 1 if 010/101, 2 if 001/110, start over if 000/111.

Inspired by the 3 case, I think the following should work if I were to use the same coin to generate 1-5 with equal probabilities:

Generate 5 bits. There will be 2^5 = 32 combinations in total. Occurrences of {1 one, 4 zeros} = {4 ones, 1 zero} = 5, whereas {2 ones, 3 zeros} = {3 ones, 2 zeros} = 10. Return each number by exhausting the following sequence: (1 combination from {1 one, 4 zeros} , 1 combination from {4 ones, 1 zero}, 2 combinations from {2 ones, 3 zeros}, 2 combinations from {3 ones, 2 zeros}). Repeat the process if either {00000, 11111} is obtained.

Note: the above process could be generalized to other prime, odd numbers. Non-prime odd numbers fail as there exists an odd number d, 2k < d such that d cannot divide C(d, 2k) without a remainder. E.g. d=9, k=3.

It's entertaining to think through this kind of probability questions sometimes. Happy weekend!

Solutions for Python Challenge

Rose Lin — Fri, 21 Jun 2019 03:54:28 GMT

You may prefer tackling the challenge by yourself first before referring to my solutions 😀. Off you go: Python Challenge

Challenge 0

Just get 2^38 and substitute the result into the URL.

Challenge 1

Code:

s = "g fmnc wms bgblr rpylqjyrc gr zw fylb. rfyrq ufyr amknsrcpq ypc dmp. bmgle gr gl zw fylb gq glcddgagclr ylb rfyr'q ufw rfgq rcvr gq qm jmle. sqgle qrpgle.kyicrpylq() gq pcamkkclbcb. lmu ynnjw ml rfc spj."
#s = "map"
translated = ""
for c in s:
   if c.isalpha() and c < 'x':
     translated += chr(ord(c)+2)
   elif c.isalpha():
     translated += chr(ord(c)-24)
   else:
     translated += c
print(translated)

It's a simple mapping - every letter in the encrypted text should be moved by 2 (hence we get 'm' for 'k', 'o' for 'm' etc). Notice the special case: when the letter is y and z, you can't just refer to the ascii table as it will return you non-letter. Basically 'y' is for 'a' and 'z' is for 'b', so a special handling is required. Once you decode the helper text, the next part is just applying it to the URL.

Challenge 2

Code:

import string
s = """ 
(the text you get from the source)
"""
translated = ""

for c in s:
  if c not in string.punctuation and c != '\n':
    translated += c
print(translated)

I have to admit that this is not the MOST elegant solution (maybe regex could handle it better, but I'm not a fan of regex afterall). When inspecting the page source you should be able to find a big chunck of text that's mostly composed of punctuation marks. You can then parse it using string.punctuation.

Challenge 3

Code: (not using regex here)

s = """(the big string in the source code)"""
s = s.replace('\n', '').replace('\r', '')

res = ""
i = 0
for i in range(len(s)-8):
  current = s[i:i+9]
  if current[0].islower() and current[1:4].isupper() and current[5:8].isupper() and current[-1].islower() and current[4].islower():
    res += current[4]
print(res)

I know, I know very well that the solution here is ugly. BUT IT WORKS! Basically you should be looking for patterns like xXXXaXXXx, where a is the desirable letter. It's a bit elusive to read the hint this time, but thanks to this post I finally got it to work (not in a nice way).

Challenge 4

import urllib3

http = urllib3.PoolManager()

path = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing='
num = "12345"
res = [num]

for i in range(400):
    r = http.request('GET', path+num)
    if r.status == 200:
        num = r.data.split()[-1]
        res.append(num)

print(res)

The hint is buried in the back again: don't try to loop it for over 400 times. It will manifest itself at index 357 (am I disclosing too much?)

Challenge 5

import pickle
from urllib2 import urlopen

src = "http://www.pythonchallenge.com/pc/def/banner.p"

test = pickle.load(urlopen(src))
for line in test:
    print("".join([k * v for k, v in line]))

This is clever hahahaha. I didn't think of the solution myself (thanks for the reference!), but with an extension like .p I should have realized that it's related to pickle. Once you unpickle it, the way to put it back also amuses me (list of tuples, not fun to deal with huh).

Challenge 6

Cloudflare integration, and more...

Rose Lin — Fri, 07 Jun 2019 14:52:05 GMT

This is my very first post in China (and hopefully not the last one lol). So far so good, except that my server IP was recently blocked due to service misuse (more info). Have learned several things along the way, but first I need to rescue my blog & portfolio site so that they are accessible in China. As a side note, the domain was purchased from Namecheap and DNS was configured under DigitalOcean.

TL;DR

Use Cloudflare to hide your server IP, protect your sites from potential hacks, and fast loading in different parts of the world.
Don't force a single VPS to wear multiple hats (like web hosting, mail server, and SS).
If your VPS image handles both web hosting and mail service together, manage carefully on Cloudflare to ensure that mail server works just fine.

Why Cloudflare?

This is what they said...

Cloudflare also provides security by protecting Internet properties from malicious activity like DDoS attacks, malicious bots, and other nefarious intrusions.

Basically it is a free content delivery network (CDN) that speed up page streaming based on geolocations. In my previous DNS setup, the blog subdomain directly pointed to the server IP - a simple ping would uncover the IP immediately. It was not until the IP was blocked in China did I realize that this is a dangerous act. With Cloudflare, the domain name will not expose the server IP anymore. Since the domain is now handled by Cloudflare, my websites should be viewable in China (and even faster!) as long as their associated IPs are not blocked. It is to my understanding that the associated IP would change periodically (or location based?), not confirmed though.

How to configure Cloudflare?

Simple -

Register an account on Cloudflare
Enter domain names. Cloudflare will automatically scan existing DNS records for you.
Confirm that the captured DNS records match with your current setting.
Update your nameserver information at your domain provider.
Wait a bit (for DNS praprogation), and you're good to go!

It really takes less than 5 minutes. This guide provides more details.

Ugh, hard lessons

This is my first time setting up a VPS, so I was greedy: I used it for web hosting, mail forwarding, and shadowsocks (SS) server. THIS IS NOT A GOOD PRACTICE. Specifically for Digital Ocean, it is encouraged to separate these tasks into different droplets (so that if one is down, others will stay intact and function normally). But for me, since all these features were under the same droplet, their performances are dependent upon each other. As the IP was blocked in China (due to SS), requests sent to my hosted websites timed out indefintely. (It is said that the IP will be removed from blacklists sometimes in the future, but as of today no) My turnaround was to use CDN (i.e. Cloudflare), but then all DNS records are maintained by Cloudflare instead. The provider did a great job auto-scanning and importing existing DNS rows from Digital Ocean - except for the mail server.

The moment I turned Cloudflare on, I stopped receiving any emails from my domain mailbox. I didn't realize this until two days later, when Gmail prompted that my test email was undeliverable. The error message is displayed below:

I then found that the 4 IPs belong to Cloudflare. Below was a screenshot of my previous DNS setting:

Notice the two rows highlighted in yellow. These records determine how mail DNS works and I got both wrong.

When the A record for mail got a grey cloud, Cloudflare warns that the record was exposing my IP. This is the right way to go, though; Cloudflare does not host my mail server, so no HTTP proxy is required (only DNS). Moreover, name of the MX record should be the domain name. I spent two hours trying to figure out why Postfix suddenly stopped forwarding emails to my personal mailbox, only to find that it's not Postfix's fault but DNS issues :P. Now I have it setup this way and the mail server is working as usual:

In short...

Don't be aggressive and put all eggs into the same basket. If you do and it fails, take the failure as a way to learn and remedy right away lol.

Graduated!

Rose Lin — Mon, 13 May 2019 03:31:26 GMT

The day has FINALLY come. Bye school! It's like a dream. Haven't seen my parents for 20 months and they flied over just to show up & support. Super grateful 💗

Missions accomplished during the past 2 years:

Graduated with 4.0 GPA! (10 classes in total)
Completed 2 internships and received return offers from both (and felt so sorry to turn one down. They were both AMAZING!)
- One of the internship is related to data science - the primary reason why I went back to school! Almost gave up hope on that (thought that I'd have to obtain a PhD first before even trying to submit an application LOL), maybe could write a post sharing my experience in the near future ...?
Got to work with talented people. My professors are not only knowledgable but also kind. I've learned a ton from them outside of the academic subjects~
Made new friends. Hopefully our relationships will be long lasting!
It's just great to be an Aggie.

Bye, College Station. I will carry the memory along and step forward. Next step: Dallas~

My very first time on Military Walk, Aug 2017

How to: migrate from Jekyll

Rose Lin — Sun, 05 May 2019 01:25:37 GMT

Today marks my 2nd day with Ghost 😘 I spent nearly a whole night yesterday trying to figure out how to migrate from my current Jekyll site (hosted under Github pages). All my technical notes are stored there; I did think about moving my wordpress.com content over, but that's another story (I used to write a post about how to effectively extract information from WXR. In Chinese though). Nonetheless, it turned out to be a much challenging task.

My current Jekyll site. Simple yet powerful.

I have a good starting point: the unofficial plugin recommended on the migration guide. However, the script is seriously outdated (last updated 5 years ago); it's not surprising to find that it failed in multiple places.

Challenge 1: Run `jekyll build`

This is for Windows 10 user only. As a ThinkPad user I really don't want to compromise and install Jekyll using some tricks. The official Jekyll docs provide another way out: using bash.

Challenge 2: `undefined method` getConverterImpl'`

This method seems to be deprecated after Jekyll 1.4.1 (this) is the last reference I've found); the alternative is to manually update this line to be find_converter_instance (source).

Challenge 3: Import error

See the screenshot below. I also noticed people reporting the same issue on the forum (not Jekyll, but we have the same symptom).

Looks like the workaround is to grab a Ghost 1.0 site on a Docker image. I don't know how to use Docker yet (willing to learn) but really don't want to go through the setup again (partially because my local machine is not Linux/Mac 😂). So... I spent some time diving into the guide again, and found what elements were missing.

The fields required under post were changed. Especially the "mobiledoc" part.
Existing script does not include any author information, which might exist if the user has specified authors.xml under the _data folder (source).

Initially, I would like to write another script using my favoriate language (Python, yes) to further process the generated JSON. But then I realized why not just modify the ruby file directly? And, FYI, last time I ever wrote any line in Ruby was back in Fall 2017. I can still read the code (somehow) but had no confidence in writing. Yet, I made a way to enhance the plugin so that it works for Ghost 2.0 import, and you can find the code and how to use it under my repo.

And finally...

All my 8 posts were migrated successfully! Just a side note: when manually adding the author information, make sure that the slug field has the same slug as you set in your current dashboard. That way, Ghost will automatically link the current users with all imported posts.

My site now with the two sample imported posts.

I still need to update the script so that it knows how to extract author info from Jekyll. If you have an idea how to do so, please let me know (have been trying different combinations with no luck)! Happy blogging & coding 😀😀😀

Hello World!

Rose Lin — Fri, 03 May 2019 16:21:19 GMT

Testing to see if Ghost is up and running.

Little princess feeling dizzy when we were eating.

Moon river

New Domain!

Hello from Ghost 5.x

Migration from Twitter to Fediverse

Useful Tools

Observations

Quick Note on AWS Package Installation under Company Proxy

Causal Inference: Basics & My Thoughts

So, What Is Causal Inference?

Randomized Experiments vs. Observational Study

Identifiability Conditions

Causal Diagram

Confounders & Effect Modifiers

G-methods

IP Weighting

Standardization

Things I Wish to Know More

End

A (series of) Mysterious Flask Error(s)

Import within the same folder no longer works?

Googled Solution 1: Out of sight, out of mind

Googled Solution 2: Add system path

... But what about the local database?

Correct way: still around file imports.

Afterthoughts

TL;DR

My (Not-so-Tough?) First-try with Docker

Foreword

Dockerizing...

Challenge 1: cannot run docker-compose up

Challenge 2: ... my container exits itself?

Challenge 3: I really don't know how Docker works.

Now

Why My Saved Model Is Not Working as Expected?

WTF??

Model Diagnostics

Be careful with Pipelines...

In summary

Feature Engineering for Machine Learning - Chapter 2 note

Feature Engineering for Machine Learning

Ch.2: Fancy Tricks with Simple Numbers

Make a fair coin from a biased one, and more...

Solutions for Python Challenge

Challenge 0

Challenge 1

Challenge 2

Challenge 3

Challenge 4

Challenge 5

Challenge 6

Cloudflare integration, and more...

TL;DR

Why Cloudflare?

How to configure Cloudflare?

Ugh, hard lessons

In short...

*Graduated!*

How to: migrate from Jekyll

Challenge 1: Run jekyll build

Challenge 2: undefined method getConverterImpl'`

Challenge 3: Import error

And finally...

Hello World!

Challenge 1: cannot run `docker-compose up`

Graduated!

Challenge 1: Run `jekyll build`

Challenge 2: `undefined method` getConverterImpl'`