Gradients

How to Write an Impactful Statement of Purpose

2023-12-01T00:00:00-08:00

As the grad school application season unfolds, students worldwide, including yourself, are diligently preparing their applications. The statement of purpose (SOP) stands out as the most crucial and anxiety-inducing component of this process.

Typically, SOPs follow a formulaic pattern, delving into past and future research endeavors and potential advisors. Yet, considering the volume of applications discussing similar topics, your meticulously crafted statement may end up being quickly perused and grouped with others.

Let’s cut to the chase on what you should include in your SOP to make it stand out and keep the admissions committee hooked.

The overall structure

Every effective pitch contains a set of common elements, and the ideal SOP is no different than an effective pitch. Your SOP will hit the following points:

Something about you that makes you more than another piece of paper.
Your goals in grad school.
Why you’re the right fit for it.
Why the institution is your ideal playground.
What value you bring that justifies their investment.

Who are you?

Check out the CS PhD Statement of Purpose website, an excellect resource created to curate many different statements of purpose so that you know what the SOPs of students accepted to various institutions look like. Note that many lack personal insights, focusing solely on research. Differentiate yourself by incorporating anecdotes about your research interests’ origins or extracurricular activities, ensuring relevance to your overarching narrative.

If you can get the application reader to know a bit about who you are, it will go a long way towards helping you stand out from the crowd of other supremely motivated individuals. Include something about how you found your current research interest, or what you find yourself doing outside of work. As long as it ties into your overall story, these personalizing touches will help the reader stay engaged.

However, avoid cliches such as “I was interested in solar exploration since I was a child.” We all know you were interested in LEGOs and tea time with your stuffed animals.

What are you going to do in grad school?

A statement of purpose must include a purpose. Your Ph.D. program will be a time of deep discovery. You will learn a lot about yourself and have an immense amount of time to come up with new methods to push your chosen field forward. This is not a commitment you will make lightly. Therefore, it is critical to have a well-thought-out section in your SOP that explains what it is, precisely, that aim to pursue in grad school.

It’s not sufficient to say “I want to build new satellites that help harness the Sun’s energy.” You have to elucidate a game plan: what is the limitation with current methods, why is that bad, and what are you going to do about it? This involves performing literature review, analysis of failures, and a creation of hypotheses. Essentially, you are going to shine in your SOP by showing you know how to engage in a thoughtful research process by engaging in a thoughtful research process to write your SOP.

However, maintain realistic goals. If you make your research goal unbelievable, like “I am going to land a person on the Sun,” then the lofty goal may overshadow the rest of your SOP, risking it being skimmed.

Why you?

Every SOP requires a persuasive self-portrait. You are competing against other highly qualified people. You just wrote down a lofty, yet achievable, research vision. What makes you qualified to execute on it? You want to dedicate a paragraph or two explaining what makes you special. Whether it’s your social acumen or project management skills, emphasize what sets you apart and makes you an asset.

It’s critical to explain how your perspectives, ideas, and workflows put you in an ideal position to solve the hard problems detailed in your research vision. If you can write about what differentiates you from the crowd, you can convince the application reader that you are one of the few people in the world who is uniquely qualified to tackle this problem. This is yet another place where you can personalize your SOP and really pop off the page.

Why them?

It’s good practice to have your SOP directly address a few, select faculty members and explain how the written research vision aligns with their labs’ work. However, most SOPs fail to explain how the institution at large will contribute to the success of your research and your growth as a researcher.

More likely than not the institution you are applying to is in the middle of some re-alignment that will make them an increasingly multi-disciplinary place of study (who isn’t, these days). Contextualize your presence on campus, emphasizing your ability to leverage interdisciplinary resources. Be sure to explain how you will be able to utilize the vast amount of resources available to you as a graduate student across the entire campus, not just in your lab or department. Faculty work really hard to get funding for these initiatives–detailing how you will benefit from this work can help the application reader contextualize your presence and impact on campus.

What are they getting out of you?

This section is perhaps the most crucial yet often overlooked aspect of the SOP. Realistically, the institution you’re applying to will invest a substantial sum—potentially over half a million dollars—over the next five years to educate and support you. This sizable investment demands careful consideration, akin to any strategic financial decision, focusing on the return on investment (ROI).

The concept of ROI in this context is broad, some of which should be addressed explicitly or implicitly in your SOP. Primary among these is your potential for academic output. With your writing in previous sections, you have already made this extremely clear. A good research vision and the way that you are qualified to make it reality is largely sufficient to meet the mark.

However, academic prowess alone is not sufficient. Institutions also want to ensure that the admitted cohort is diverse, dynamic, and capable of enriching the broader student body. Your unique experiences, perspectives, and potential contributions to campus life beyond the classroom should be underscored. Discuss how your involvement in extracurricular activities, leadership roles, or community engagement can positively impact the overall student experience.

Furthermore, consider how your interdisciplinary interests and collaborative nature align with the institution’s values. Schools are increasingly aiming for a multidisciplinary approach to research and education. Addressing how you plan to leverage the vast resources available on campus, collaborating across departments and disciplines, can set you apart. Highlighting your adaptability and willingness to engage in cross-disciplinary initiatives demonstrates an understanding of the institution’s evolving academic landscape.

Your commitment to diversity, equity, and inclusion is another valuable aspect to emphasize. Convey how your unique background and perspective contribute to fostering a more inclusive academic environment. Institutions actively seek individuals who can bring different voices and ideas to the table, creating a vibrant and intellectually stimulating community.

In essence, this section of your SOP should paint a comprehensive picture of the multifaceted contributions you bring to the institution. Set yourself apart from the purely academically brilliant candidate by showcasing a holistic approach to your role as a graduate student. Address how your presence aligns with the institution’s overarching mission, fostering not only academic excellence but also a diverse and enriched campus environment.

By addressing these nuanced aspects, you not only demonstrate a clear understanding of the mutual benefits of your enrollment but also position yourself as an investment worth making! Your SOP becomes a compelling case for why the institution should not only educate you but also eagerly welcome you as a valuable and dynamic addition to their academic community.

Good Luck, and Take a Breath!

Alright, I know this whole grad school application thing can feel like you’re navigating a maze blindfolded while juggling flaming swords. Revisions, Google Doc comments coming at you from all angles, and don’t even get me started on finals – it’s a whirlwind.

But here’s the deal: amidst the chaos, take a moment. Inhale, exhale, and remind yourself of your story. Your SOP is not just a piece of paper; it’s your narrative, your journey, your ticket to the next chapter. So, when the folks editing your SOP read it, make sure they see you in a new light. If they get you a bit better after reading it, you’ve nailed it.

Remember, stressing after hitting submit won’t change a thing. The decision’s out of your hands. So, find something to do with your newfound free time. Binge-watch a series, pick up that neglected hobby, or just chill – whatever floats your boat.

Now, cross your fingers, hope for those Feb/March interview emails, and embrace the uncertainty. You’ve put in the work; now let the universe (and the admissions committee) do its thing.

Good luck, you got this! 🚀✨

Post-Script

My statement of purpose for grad school is available here as reference.

The Department of Defense is Prioritizing Open Source Software. Here’s How Open Source Projects Can Benefit.

2022-02-07T00:00:00-08:00

On January 26, 2022, the new Chief Information Officer (CIO) of the U.S. Department of Defense (DoD), John B. Sherman, released a memo to the entire Department titled “Software Development and Open Source Software”. In this memo, the CIO addresses two primary concerns: 1) using open source software (OSS) introduces supply chain risks for DoD software programs, and 2) sharing DoD code via open source channels without proper checks enables potential leaks of proprietary DoD information to adversaries. In laying out how these two concerns should be addressed properly, the CIO categorizes OSS into a unique position, one which can be utilized by OSS foundations and project maintainers to gain funding for their essential contributions.

Open Source Software in the DoD

The U.S. DoD, much like other large organizations, is a user of OSS in almost every part of its software toolchain. From Firefox to access the internet, Kubernetes to orchestrate containers on internal data centers and the public cloud, curl to make HTTP requests, PyTorch to create and run machine learning models, the examples of OSS in use across the DoD are endless. In addition to being a first-party user of these software and toolchains, the DoD also uses OSS daily as components of third-party software acquired from contractors.

Beyond just being being a consumer of OSS, the DoD is also a producer of OSS, such as the utilities used by the DARPA Cyber Grand Challenge, machine learning models created for the Defense Innovation Unit’s xView series of prize challenges, and even the DoD’s OSS code.mil website itself. Taking into account repositories open sourced on DoD contracts, the organization contributes a large amount of software into the public domain.

Lack of OSS Funding

Even though the DoD uses and creates a lot of open source software, it rarely funds OSS initiatives itself. Besides rare examples such as DARPA funding the Open Source Robotics Foundation for support and development on their programs, money from the DoD does not flow to support the further development of critical open source technologies. In my opinion, there are two reasons for this:

OSS projects do not fit within the traditional acquisitions framework, therefore making it difficult for the DoD to give contracts to projects with no responsible entity.
More opinionated, but with no money on the line, the DoD has no incentive to pay for services it is otherwise receiving for free. In my experience, government programs take free things for granted, whereas commercial products, even if terrible, get much more care and interest.

An Opportunity for Funding

In the CIO’s memo, OSS software is elevated to the forefront of software acquisitions.

The Department must follow an “Adopt, Buy, Create” approach to software, preferentially adopting existing government or OSS solutions before buying proprietary offerings, and only creating new non-commercial software when no off-the-shelf solutions are adequate.

This is incredible news in itself. DoD program managers (PMs) must now evaluate the landscape and verify that no existing OSS solution solves their problem before launching a new program that could potentially waste taxpayers dollars. What about in instances where the OSS solution does not 100% fit the needs of the PM’s program? In this case, a contractual mechanism must be put into place to acquire, develop, and maintain this new development. To this end, the CIO states that

OSS meets the definition of “commercial computer software” and therefore, shall be given equal consideration with proprietary commercial offerings, in accordance with Section 2377 of Title 10, U.S.C.…

By ensuring that OSS solutions are considered commercial software, it follows that DoD PMs should include OSS as legitimate competitors against commercial offerings. With acquisitions mechanisms such as the Other Transaction Authority (OTA) that impose softer requirements than traditional FAR-based methods being leveraged by DoD innovation organizations such as the Defense Innovation Unit and various Combatant Commands, OSS foundations can be in competition for much-needed funding from the DoD directly.

By cutting out the middleman who re-package OSS solutions and sell them to the DoD, OSS foundations can fully realize the monetary investment currently indirectly being made to their technologies.

How to Get DoD Funding

Of course, even with this memo, there will be a lot of reticence within the DoD to award contracts to faceless OSS projects. Here is my guidance on how to make sure that your project has a chance of being considered for DoD funding.

If you have a large project that is getting significant usage, incorporate a 501(c)(3) organization which will officially manage the project. Funding for the effort will live in this 501(c)(3) and will require proper governance structures. As a large project, you probably have some governance already in place. While the DoD strongly prefers companies to be U.S.-based for contracts, this is not an entirely excluding factor. If a slide deck is too much effort, at the least include a FUNDING.md in your project’s code base so more technically savvy PMs can build a justification using your own words.
Create a minimal slide deck with your project’s core offering. With an overhead as low as 5-10 slides, this will capture what your project does, and how it can be applicable in broader settings. When reviewing capabilities, DoD PMs will better be able to conceptualize how your project fits into their program.
Leverage agencies which wield the OTA and apply judiciously to their calls for proposals. Organizations like the Defense Innovation Unit regularly post solicitations for programs which have a strong software component, and a slide deck is all you need to be considered.

The Ethical Considerations of Working with the DoD

There is debate about our responsibility as software engineers when it comes to the ethical use of the software we create. Receiving DoD funding directly may raise flags when it comes to the development of your OSS project.

Pragmatically, however, the DoD is already using your software. In its current state, third-parties with no consideration for the proper use of your software are packaging it up and selling it with no say from the project maintainers. By directly accepting DoD funding and being a first-party provider of technology, you can ensure that the DoD is using your software in a fair and responsible manner.

With the DoD adopting principles and guidelines for the ethical use of complicated AI software toolchains, there is support internally to ensure that software is used responsibly. By participating in this process first-hand, you can help the DoD set a framework for similar technology uses in the future.

Closing

This memo is extremely young and actual implementation of this memo remains yet to be seen. However, the future of OSS within the DoD is bright, and maximizing this wave to benefit the community at large is in the interest of everyone who depends on OSS around the world.

Edit: There is discussion about this post on Hacker News. I encourage people to engage; there are many things I have not thought about that could go a long way into creating a framework for DoD funding of OSS projects.

Thoughts About My High School’s Math Curriculum

2019-02-04T00:00:00-08:00

I went to high school to Thomas Jefferson High School in Pittsburgh, PA (not the famous one in Alexandria, VA). I wrote an email to my math teachers there talking about my thoughts about our high school math curriculum now that I’ve been through an undergrad, currently taking grad courses, and have a full-time job doing math for a living. I’m pasting the emails as-is here, with some additional formatting. Please let me know if you have any comments or criticisms.

Hey,

In my job as an artificial intelligence researcher, I am exposed to and use math every day. Usually this takes the form of probability and mathematical statistics, calculus 2 stuff and calc 3 partial derivatives and double integrals, linear algebra, and then some conglomerate stuff like vector calculus, calculus of variations (rarely), and differential equations (rarely).

I took a whole bunch of math in my undergrad, and now I’m taking some grad level math courses at CMU. Even with those courses, I had to self-teach myself the majority of the math I now use daily, which meant I had to fall back on my existent knowledge. I realized today that there were things I wished I learned in high school, or at least got an intuition or appreciation for, that would help me today.

1) Proofs. In high school, I was only ever introduced to two-column proofs, which no one ever uses in real life (unless they’re doing computational proofs). The most useful proofs I have seen, and struggle with the intuition at times, are induction and contradiction. If the principles of these proofs could be introduced in an intuitive way in high school, I bet kids would be much more capable of taking small concepts and generalizing them to large lemmas.

2) Math notation. As you know, everyone and their mother just makes notation up and uses it however they want, as long as they’re consistent within their textbook, research papers, etc. It took me forever, and this is something I still struggle with, to learn to read math. It’s like its own language. I think it’s something a lot of people struggle with. This handout (http://pi.math.cornell.edu/~hubbard/readingmath.pdf) helped me create a path towards understanding how to parse these arbitrary symbols in a reasonable manner. I would have loved to see this during high school.

3) “Tricks”. Every time we’re proving something or deriving something, little math hacks always show up. Re-arranging equations to help with elimination, multiplying by inverses which have the effect of multiplying by 1, but again, help with elimination of variables, and more. These tricks, unless I’ve seen them before, completely stump me. I have no idea when I’m staring at a piece of paper for the first time where and how I should use a trick. Learning to identify these patterns earlier in my mathematical life would’ve been a game changer. I know this is a difficult one, because it’s not something you can just teach. But just exposing students to “this is a thing that you’ll see every day in math, and the textbooks will just gloss over it,” would be amazing.

4) Converting word problems to math. This is a fundamental skill we work on since elementary school, but for some reason I found certain things in high school suddenly stopped doing this. Sometimes when I’m given a real-world problem, like “Estimate the average number of people leaving an area given that a storm of a certain intensity is approaching,” I find myself struggling to turn that into actual math. After a while I’ll eventually realize that this is just conditional expectation, but then choosing the correct random variable to define certain parts of the problem, etc., is difficult. Obviously kids aren’t doing all this stuff at TJ (yet, I know probability is a course to be offered soon), but the general point is that this is an important skill. You guys already do a good job of it, but I would have loved to see more of it versus rote usage of formulas (like on the AP test).

5) Inequalities. I know in pre-calc we have a whole section on inequalities, but I guess I mean more advanced usages of them. Let me start with an example. When we’re working with dependent joint distributions, calculating the probability of some \((x,y)\) pair results in a double integral where the bounds are in terms of \(x\) and \(y\) themselves. Therefore, the support of the joint distribution is also defined in terms of \(x\) and \(y\), usually in the form of an inequality. Converting the integral bounds to what the support should be, and vice versa, is non-obvious to my mind, since the only time I exhaustively did inequalities was in high school. This may be a failing of my undergrad education, but it may be an opportunity for high school to succeed. Obviously my example is highly contrived and way too advanced for high school, but I can imagine simpler examples where inequalities like \(x \leq y \leq 1\) would come up, and a high schooler should be able to reason about those.

6) Sums, their notations, and their uses. I’m pretty sure we learned about arithmetic and geometric sums in pre-calc, but they’re used so often in math that they should be a bigger deal. People are still so scared of the “big E” (\(\sum\)) and the “big pi” (\(\prod\)) after leaving high school that I think we’re doing a disservice by not introducing the intuition behind sums and their notation a little bit better. As far as I remember, we talked about arithmetic sums, did a toy example, then jumped into the sum identities (like factoring out constants, etc.). Most people will probably use sums in their careers, and I think some extra intuition and exposure will go a long way with making people comfortable with sums.

Not all of these thoughts are valid, or actionable. They’re just things I wished I knew more about in high school, because I think having these tools would’ve made the rest of my math education that much easier. You both are the sole reason why I love math, and you both do a fantastic job already. I hope that any of this feedback is helpful. Let me know if I can clarify, I’m just rambling at 2:30 in the morning.

Also, what the heck is a lemma, claim, theorem, proof, axiom, etc., anyways? I didn’t know that in high school. All of these terms showed up in textbooks when Matt Brock and I tried to teach ourselves Calc 2 for the AP Calc BC exam.

Setting up people to succeed when self-studying, in my opinion, is the only way to make successful TJ mathematicians. If you can’t read a textbook, you’re stuck being dependent on a teacher, which means when college hits, you’re screwed. How can we set up students to self-study effectively?

Thanks for everything and setting me up for success.

Ritwik

Exploring Deep Learning: Theory and Practice

2017-10-05T09:03:00-07:00

This is a page for my talk given at the CMU Data Science Club at Doherty A302 on October 5th, 2017.

Data

Grab all the data from the Kaggle Diabetic Retinopathy challenge here. Note that this is a large dataset, and if you’re downloading this on CMU-SECURE, you will go over your daily bandwidth limit.

Prerequisites

Python 3.5.X

Setup

Open your terminal/command prompt.
Verify Python 3.5.X is installed by running python -V (or python3 -V for systems where Python is alt-installed).
Create a folder to run the project in.
Install virtualenv using pip install virtualenv.
Install and activate your virtual environment.
1. Install: virtualenv venv
2. Activate:
  - Linux/Mac: . venv/bin/activate
  - Windows: venv\Scripts\active
  - (you’ll now have a running virtual environment. Execute deactivate at any time to leave the venv.)
Install the following packages via pip (By running pip install package-name):
- ipython[all]
- numpy
- tensorflow (or tensorflow-gpu, if you have a supported GPU)
- keras
Launch your iPython Notebook using jupyter notebook --no-browser (on Python 2.7.X, this is ipython notebook --no-browser).
Go to the URL printed on the terminal and paste it in your browser, if your browser wasn’t already automatically launched.

Notebook

If you’re on mobile, view the notebook here.

Slides

These slides are for reference. A recording of the talk should exist with the CMU Data Science Club. If you’re on mobile, view the slides here.

It appears you don't have a PDF plugin for this browser.

Making a Self Driving Car - Part 1

2017-05-05T02:05:14-07:00

I just graduated from the University of Pittsburgh and have some time to relax before starting my job, so my dad and I decided to build a self-driving car. We think we can do it, and even if we can’t, at least we’ll learn some cool stuff while doing it. The goal is to start simple and build as we go.

Step 1 - Car Data

Before doing anything in the world of machine learning, we need to:

Understand the problem.
Understand the required data.
Get the data.

Since we’re starting very simply, we aimed to create a self-driving car that would simply look at the road and turn the steering wheel by itself. We would manually provide braking and acceleration as needed. But first, we need a lot of training data. There is a terrific paper by NVIDIA titled “End to End Learning for Self-Driving Cars” that details a very simple self-driving car. It relies on camera input and the angle of the steering wheel as inputs to a CNN, which produces steering angle as an output. Let’s work on getting the steering angle!

CAN Bus

Modern cars are computers with wheels. Made up of many microcontrollers and ECUs (engine control units) that all control various systems in the car, from the engine and the drivetrain to the infotainment systems and the air conditioning. These controllers read and write data to their connected systems in order to control the car.

Having so many systems is a networking nightmare. Data has to be sent around the car since it is used in multiple places, which means miles of wiring underneath the car. This is not only inefficient, but also cost ineffective. To solve this issue, the CAN Bus was created.

Image obtained from Volkswagen Audi.

(Note: the following is not a comprehensive CAN guide, it’s just enough to give you the gist of what’s going on! I learned the majority of this from Wikipedia, car repair manuals, and this wonderful presentation by Volkswagen Audi.)

CAN, standing for controller area network, is a protocol in which all control units in the car send digital signals over a copper wire bus (in the future, this will be fiber optics) at a max rate of 1,000 kbps. There are generally three separate CAN systems in a modern vehicle:

High-speed drive train CAN Bus generally operating at 500 kbps (500,000 baud).
- Brakes, steering wheel angle, ECUs, etc.
Low-speed infotainment CAN Bus generally operating at 100 kbps (100,000 baud).
- Radio, navigation, etc.
Low-speed convenience CAN Bus generally operating at 100 kbps (100,000 baud).
- A/C, seat memory, etc.

The CAN bus works like an argument at a bar: everyone has stuff to say, so everyone starts yelling at once. Whoever starts yelling first wins after a bit and everyone stops talking and listens to them till they’re done, and then the yelling starts again.

Image obtained from Volkswagen Audi.

There are two states the CAN bus can be in: low (0V, dominant) and high (5V, recessive). When no one is yelling, the CAN bus will read 5V, otherwise there will be constant digital chatter. Each message carries a specific format which I will not go into, but all we need to know is that:

All messages are sent to all control units, aka, CAN is a broadcast system.
Messages are filtered by their message headers, indicating the type of information.

In short, this means that control units are aware that someone else is talking, but only actually listen to the conversations pertaining to them.

Snooping on the CAN

Now that we know that our desired steering wheel angle data is to be obtained from the CAN Bus at 500,000 baud, we have to snoop on the CAN Bus to listen to all device chatter and intercept the message header pertaining to the steering wheel angle.

As of 1996, every car in the US is required to be shipped with an OBDII (on-board diagnostics v2) port. This connector has 16 pins, but most importantly, it has two pins for CAN High and CAN Low.

OBDII port. Image obtained from obd2allinone.com.

My thought was that if I could hook up a serial DB9 cable to those pins in my 2000 Toyota Camry’s OBDII port, I could read the CAN signals on my laptop and record all the data. Using the following pinout, I built the franken-cable (note to kids: this is super hacky and a piece of paper is not proper shielding).

Pin description	OBDII	DB9
Chassis Ground	4	2
Signal Ground	5	2
CAN High	6	3
CAN Low	14	5
Power	16	9

Horribly compressed image of the franken-cable. Use at your own risk.

Check Your Assumptions

This method would have worked had it not been for the fact that my 2000 Toyota Camry does not have a CAN Bus. Since the CAN Bus was widly adopted into cars around 2006, my car is too old to have this technology. Hell, it has relatively no electronics compared to cars of today. It’s smart to check your assumptions before diving all in. However, all is not lost. We plan on attaching a tilt sensor to the steering wheel and getting steering wheel angle data through that. To do this, we plan on introducing a Raspberry Pi to the car that will not only take in the tilt sensor data, but also interface with cameras and process their data at hopefully near 30 FPS to provide a near real-time data pipeline.

More to come as we tinker around!

Understanding HL7 Structure

2017-03-29T11:00:00-07:00

What is HL7

HL7, more formally known as Health Level-7, is a standard used in the healthcare industry to transfer clinical and administrative data. Much like many SaaS services “speak JSON”, healthcare applications “speak HL7”.

Versions of HL7

There are two main versions of HL7: HL7 v2.x and HL7 3.

HL7 2.x

HL7 2 was originally created in 1989, dubbed “Pipehat”. Since then, we have had a lot of revisions of the standard, giving us:

HL7 v2.1
HL7 v2.2
HL7 v2.3
HL7 v2.3.1
HL7 v2.4
HL7 v2.5
HL7 v2.5.1
HL7 v2.6
HL7 v2.7
HL7 v2.7.1
HL7 v2.8
HL7 v2.8.1
HL7 v2.8.2

All HL7 2.x versions are backwards compatible! This is good news for new HL7 applications, as supporting HL7 v2.7 means that you would be supporting everything from v2.1-v2.7, but not anything above that.

HL7 v3

HL7 v3 was published in 2005. It is based around a formal methodology called HDF (ISO/HL7 27931 for those of you who love standards). It does some really fancy stuff that hasn’t really been adopted yet. For this reason, this post will focus mainly on HL7 v2.x.

HL7 Structure

For this section, we will use the sample HL7 message below (obtained from here) to analyze. Please note that this is dummy data and not actual PII (aka don’t sue me, thanks).

MSH|^~\&||.|||199908180016||ADT^A04|ADT.1.1698593|P|2.7
PID|1||000395122||LEVERKUHN^ADRIAN^C||19880517180606|M|||260 GOODWIN CREST DRIVE^^BIRMINGHAM^AL^35 209^^M~NICKELL’S PICKLES^10000 W 100TH AVE^BIRMINGHAM^AL^35200^^O||(157)983-3296|||S||12354768|87654321
NK1|1|TALLIS^THOMAS^C|GRANDFATHER|12914 SPEM ST^^ALIUM^IN^98052|(157)883-6176
NK1|2|WEBERN^ANTON|SON|12 STRASSE MUSIK^^VIENNA^AUS^11212|(123)456-7890
IN1|1|PRE2||LIFE PRUDENT BUYER|PO BOX 23523^WELLINGTON^ON^98111|||19601||||||||THOMAS^JAMES^M|F|||||||||||||||||||ZKA535529776

To start off, let’s look at the image below:

The red boxes

Every section inside a red box is called a segment. Segments are groups of data that each contain related pieces of information. Each segment has a carriage return (\r) at the end of it to demarcate the end of a segment.

The green boxes

Each segment begins with a three letter string called a header. In this picture, everything in a green box is a header (MSH, PID, NK1, and IN1). Note that the MSH header is special. A segment with an MSH header will always begin an HL7 message, and that segment is thus called the message header.

Fields

Let us look at one segment for the time being: Within a segment, each field is separated with the pipe character ( | ), called a bar in HL7 terminology. Two or more bars in a row indicate an empty field. In the example above, we can see the the first field (indexed by PID.1) holds a value of 1, PID.2 is empty, and PID.3 holds a value of 000395122. Since this is a version 2.7 HL7 message (indicated by MSH.12), we can view exactly what each PID field means here. However, fields get more and more complicated as we go along.

Components

If we look at PID.5 (LEVERKUHN^ADRIAN^C), we can see that there are carets (^) in the field value. A caret delimits a component, which are a smaller unit of fields. PID.5 describes the Patient Name and is split into three components here: a surname, a given name, and a middle initial. We can index these as PID.5.1, PID.5.2, and PID.5.3, respectively. Components can also have subcomponents, delimited by an ampersand (&), but these are rarely used.

Repeated Fields

Sometimes there are fields that can hold multiple values, and thus we must repeat them. An example of this can be an address: one for home and one for work. Repeated fields are delimited with a tilde (~). For an example, let us look at PID.11 (260 GOODWIN CREST DRIVE^^BIRMINGHAM^AL^35 209^^M~NICKELL’S PICKLES^10000 W 100TH AVE^BIRMINGHAM^AL^35200^^O). Notice that there is a tilde present here, indicating that this field contains two addresses. These two addresses are indexed as PID.11(1) and PID.11(2), respectively. Repeated fields are a powerful and flexibly way to represent information that carry the same semantics.

Conclusion

This is a simple walkthrough of the structure of HL7 messages. HL7 gets weirder and more complex as you read more into it (custom segments and groups, anyone?), but this level of understanding will allow you to perform the majority of HL7 tasks without a hitch. As a fun exercise, try writing a simple HL7 parser that implements this level of HL7 structure understanding (this is a good beginner programming project, for any beginners reading this post).

Idea for Efficient Cell Image Compression

2017-03-25T14:36:00-07:00

Premise

Pathology and histology slide images are taken with extremely high resolution cameras, resulting in:

High cost of storage
Bandwidth issues while transporting images

It is important to note that these images often rely on lossless compression, because any artifacting will result in a lowered ability for the doctor to give accurate results.

Example of a cell image.

Proposed Solution

In most regions of the body, neigboring cells looks alike to the cells bordering it. I propose the following solution to compress the images:

Segment the image into distinct cells.
Take a dot product of the matrix of cells with itself (A^TA) to figure out which cells are most similar to other cells. I call these “reference cells”.
Store a set of deltas for all other cells in terms of rotations, translation, and transformations.
Apply further compression using any standard algorithm to the deltas themselves.

Example of “reference” cells.

A set of deltas in a binary format would be better than a large amount of pixels and allow for better compression.

I attempted to do something along these lines on GitHub here, but had to stop development due to time constraints. Hopefully I can continue this down the line!

Making an API for Pitt

2015-07-25T07:00:00-07:00

Making an API for Pitt

A journey into production code.
Note: This a post from 2015 copied from my Medium. Posted here on March 25, 2017.

The University of Pittsburgh is one of the largest universities in the United States. We have a brand new CS/Psychology/Business building (heavy sarcasm), state-of-the-art facilities (as in they’re literally art buildings), and no API to access simple data that students could use to build better tools.

On Thursday night, I set out to change that.

Starting Off — Courses

The first thing I wanted to do was have a way to programmatically access course information. This should be simple. Pitt has courses.as.pitt.edu, which is a website for students to access course descriptions.

Pitt’s websites are built with ASP.NET, which is weird. I really don’t like it.

Pressing on a button takes you to http://www.courses.as.pitt.edu/results-subj.asp. The search parameters have to be passed somewhere, so it’s time to check the Developer Console.

Bingo. The query parameters are SUBJ and TERM. Easy enough. Using urllib2 and BeautifulSoup (❤), the listings page can be scraped.

Computer Labs

Things get a lot more interesting here. Computer labs are something controlled entirely by CSSD, Pitt’s IT/Technical division. CSSD does not make any APIs available, but they have signs in their computer labs showing the availability of each one. Obviously, they have to get their data from somewhere.

You can check lab availability currently at http://technology.pitt.edu/service/lab-line-check-lab-availability.

Obviously this is dynamic data. There has to be a data source. So time to find it. Diving into the Developers Console, there isn’t actually much to see. CSS loading, JS loading, images, etc. However, an interesting call to simple_proxy.php is also made.

The url parameter is just a URL-encoded string that points to

http://www.ewi-ssl.pitt.edu/labstats_txtmsg/mdefault.aspx?labid=1&0=f&1=u&2=l&3=l&4=_&5=h&6=e&7=a&8=d&9=e&10=r&11=s&12==&13=1&14=&&15=f&16=u&17=l&18=l&19=_&20=s&21=t&22=a&23=t&24=u&25=s&26==&27=1.

Turns out anything after labstats_txtmsg isn’t necessary. Accessing http://www.ewi-ssl.pitt.edu/labstats_txtmsg/ gives you this:

Perfect. Some BeautifulSoup action later, I can now reliably provide computer lab status through the API.

Laundry Status

This was arguably the most fun and frustrating part to do.

Our laundry machines are controlled by Blackboard, and the ill-advertised UI to view the laundry machine status are controlled by a company called LaundryView. I had no idea about the UI until I went to https://m.pitt.edu/ and saw a place to view laundry status.

The UI itself is ugly and not mobile-responsive. That’s a shame, because most students use their phones for everything.

The URL for the above view is http://classic.laundryview.com/laundry_room.php?view=c&lr=2430136. If you click that link, nothing will pop up, but I’ll explain that later.

Looking at the Developer Console showed nothing interesting at first. I was not looking forward to scraping this monstrosity to get machine data. However, I noticed that this page refreshes every 60 seconds. Keeping the network tab open, I noticed two new calls being made every 60 seconds.

Gold. classic_laundry_room_ajax.php is just the same UI without styling. However, appliance_status_ajax.php gives this:

Awesome! Obviously not all the info I was looking for, but this is a URL that can be accessed without any authentication, so this was immediately tossed into a get_status_simple(self, loc) method with urllib2 and BeautifulSoup.

One thing to notice is that this is a classic view. Removing the classic part from the URL gives us a nicer UI! Still not mobile-responsive, and still not helpful.

This page makes two interesting calls immediately on load.

Dynamic room data! Gold! At this point I was very excited. This was the finish line. Turns out I was only halfway there. This is dynamicRoomData.php

At this point I thought of getting a degree in cryptography or something. However, keeping this open in one tab and the pretty UI in another, I started seeing some patters (over the course of an hour or two). 1:0:0:1: seemed to indicate a machine was free, and 1::0:0: seemed to indicate a machine was in use or out-of-service. Turns out this was a great guess, but a very naive one. Time to code it up!

Road block. Turns out urllib2 was not able to read the page. cURLing the page returned nothing. At this point I realized some authentication was being passed along to this page. After a lot of digging, I realized that I just needed to pass a cookie to the page to allow for access. I got my PHPSESSID, hardcoded it into get_status_detailed(), did the parsing and filtering, and went to sleep, fully realizing that tomorrow morning, that cookie would no longer be valid.

Tomorrow morning, the method didn’t work. Great, time to find a way to reliably get a valid PHPSESSID. After a lot of digging and looking at the Developer Console, I realized I could make an initial call to http://www.laundryview.com/laundry_room.php?view=c&lr=2430151, get the Set-Cookie header, and use that to make my API call. Bingo. Now we had a reliable way of getting this data.

Later on, as in yesterday, I realized that the whole string (1:60:1:37211:60:1:0:0:1:) had meaning. The first cluster is 1 or 0, 1 being free, 0 being in use. If the machine was in use, the second cluster stands for the amount of time left. However, if the seventh cluster doesn’t exist (::), that means that the machine is out-of-service. This was all coded into the current implementation of get_status_detailed() of the LaundryAPI.

Future Goals

I have some things I’m working on. My goals are to get a way to access the People directory at Pitt, as well as a way to get housing details (with MyPitt user authentication).

I also want to host the Python API on a web server so people can simply call an endpoint to get this data.

The Pitt API is available here: https://github.com/RitwikGupta/PittAPI

Disclaimer: This write up has skipped a lot of the dead-ends I faced for brevity.