When friends ask me about living in New York, I usually answer: “there are pros and cons to everything, and if you’re willing to take the cons to get the pros, it’s fantastic.” It may seem strange to say that maxim could relate to the exciting field of SaaS data management… but there it is.
If you’re a reader of this blog, by now you know that we here at CG love the cloud and love all that SaaS vendors can bring to the table. But all this “not-reinventing-the-wheel” stuff leaves many of us with the curious question of how to access data that’s been diligently sent off elsewhere. Sure, most SaaS vendors provide excellent reporting for the data they know. But sometimes we want to do reporting across vendors, across data sets, or even just different reporting than the vendor easily allows.
Enter our “CG Platforms” internal task force, powered by our Enterprise Architecture group. Our team’s mission is to see what we can do to free up our data. For the long haul, the answer is a full Master Data Management architecture with a full-functioning middle tier. In some cases, though, you just need something small that doesn’t require diving into the deep end of learning a vendor’s API. So, how do you balance out building something small while keeping an eye on full MDM as the end-goal?
Not shockingly, success in that balance can look different depending on the vendor, the needs, and even the time we want to spend. In a series of “Data Freedom” blog posts, we’ll take a look at a few cases of how that’s looked for us and which technologies we’ve used along the way.
First up, the appropriately-named Vendor 1, on the “quick and useful” end of the spectrum….
Vendor 1: For this vendor, used by our Accounting department, the data is easy to access using its internal reporting capabilities. It also provides a feature to output nicely-formatted PDF reports. Those are helpful features for day-to-day use, but they don’t help us with any historical backup or reporting. What will we do the day Alexis de Tocqueville sets off to write the great history of Control Group? (We can dream.)
So how did we backup these dynamically created, static files in a way that’s easily accessible and both lightweight but updateable? The vendor’s own reporting was a great start– we could pull a listing with information on these reports, but for Alexis and his ilk, we wanted the prettied-up assets themselves… and we were certainly not going to click through every possible one manually.
Faced with this issue, we turned to an old-favorite, Selenium, which you may know and love as an automated regression-testing framework. But we’ve actually been able to use Selenium to do lightweight browser automation beyond just testing. For these purposes, it gave us out-of-the box tools to do a lot of the heavy-lifting. Once we imported and filtered the vendor’s regular report of the data we needed, we took a look at what Selenium could get us– with a major eye towards what we could use to re-run the script to update those documents over time.
We did it in three easy steps:
- First, we set up Selenium’s default cookie management (called the CookieStore class) to handle the security side of things. That allowed us to programmatically log into our account within an automated browser session.
- From there, we wanted to have the program ask the system to pull up a long list of PDFs. For that we used Selenium’s Java Http client libraries with the information we pulled from the vendor’s own report in order to manipulate a variable URL to send into our “browser session.” The effect was just like looping through PDF reports as if they were in regular browser tabs.
- Finally, we had to put those files somewhere. Organizing the files was just a matter of saving them out to a folder structure that made sense for future “manual retrieval” (by humans, not robots). Just as we easily pulled report information to manipulate which URLs to call, we pulled ID numbers and dates for each file. Then the program could save out the files to folders with names that made sense, i.e. customer names, report IDs, dates. On an updated run in the future, the same system will file the new reports alongside the old.
Three functions. Half a day. Repeatable backups. Have at it, historians!