Coding

What if collecting data center telemetry was a snap?

turtle

Believe it or not, this is actually the first true big blog post since I announced I was joining Intel. That is 1.5 years of being buried over my head in all the cool things Intel does in the data center. I could write a whole different post just on how much different I see the world of computing now. But I will save that for another day.

Today is a special day. Because in the time that I have been at Intel I have been busy. Busy learning, building a team, working on the problem space around the next generation of cloud computing within our data centers.

I am part of a team at Intel that has a specific vision for where we believe the next level of cloud evolution emerges. We call this Intelligent Resource Orchestration (IRO).

IRO is the idea that the cloud components running workloads and consuming hardware can do so in a highly automated way. Less human interaction, more use of modern patterns to achieve higher density, scale, and agility. This model is made up of a few primary domains that interact with each other:

  • Watching – The collector of information. The idea that all of the hardware and software state of the components and services can be consumed. This is not necessarily a single thing, but a practice of making data about the resources (like servers) available for wide consumption.
  • Deciding – The decision maker. This domain can have a multitude of purposes but serves them is a specific pattern. Something happens and it decides if something else needs to happen. If a resource is requested, a system dies, or the load pattern on one cluster changes then something may need to happen. It is the idea that a computer has the data provided by the Watcher and the context to make a decision to change the state of the systems. This is where things like schedulers, decision engines, orchestration policies and more live.
  • Acting – The doer. This is a concept that any state that can change should be changed in an automated fashion. Exposing good APIs with good patterns is key here. The idea of reducing human intervention and directly exposing the ability to change things to the Decider’s ability to choose what happens next.
  • Learning – The iterator. This is a scary new concept to some. But with the great expanse of computing power and modern innovations in both machine and deep learning, we are approaching a period of time where the computer is a useful tool for recognizing patterns in a large set of data. We need a specific domain wired to the Watch, Decide, and Act domains and look for opportunities to evolve this loop next time around. This keeps with the concept of removing the human from the loop. What If the computer can recognize that things are changing and make recommendations to improve the next decision, the telemetry to watch, or which APIs should be called upon for action?

Here at Intel we are looking for ways to push each of the domains above closer to the reality of today’s data centers.

As we did this we recognized one key thing. All of the above sits on the foundation of data. You cannot make good decisions on where to place the next workload if the Decider is blind to the workloads running. Or maybe it is blind to the truth of how those workload interact and just makes poorer scheduling decisions. The concept of both valuable, accurate, and consumable data from the hardware and software is the key first step to decision, iteration, and change needed for IRO.

Internally we looked around at the data Intel Architecture provides and it is vast. We also recognized that we should expand this even more and look for ways to extend with more data into new areas. But, their are better questions for right now. We have a lot of telemetry, specific contextual data about our systems. But is it easy to consume for an IRO model? Could consuming this telemetry from devices, servers, and more be even easier? Would that help in building better decisioning systems for cloud?

We looked at a lot of existing open source tools and we loved some of them for specific things. We looked at a lot of internal Intel telemetry tools and loved them too (did you know Intel runs some of the largest engineering clusters in the world?). But each tool was focused much closer to either a subset of the problem space or utilized mechanisms that, while friendly on a few nodes, were less friendly on 1000’s of nodes. None of them seemed to answer the question for what we wanted to achieve for IRO. Which brings up why I came to Intel in the first place, cloud innovation.

The big dirty secret to innovation is failure. It always starts with a question like “What if?” or “Is it possible?”. Sometimes it just ends with “No, I can’t” or “This is not important right now”. But every once in a while it ends with “Yes, we can”.

My job at Intel is running a smart and collaborative group of engineers working on the emerging edge of IRO. Specifically, we working on how to push the needle on cloud orchestration, scheduling, and telemetry. A part of the way we run this team is the idea of “fail fast”. Things in software move so quickly now that sometimes it is difficult to build successful solutions if it takes too long to emerge from incubation. Inside Intel SDI we decided that a percentage of our work should be trying something with the idea that failure is ok. Instead of working 5-7 years to research, design, build, and release we would look to try smaller things quickly and get them in front of people that would care quickly. If they work, great! If they don’t, start over!. This is not something you want to do on a large scale. But examples of this are out there. From Google to Netflix to Github. The idea that creativity sometimes spawns better ideas than heavy planning is important. And making smaller risky bets alongside the big ones might be a good idea.

Based on the idea of failing fast, the question the team asked ourselves was: “What would an operational framework focused on making the consumption of telemetry much easier look like?”

This question was the genesis of the project we took on in the first part of this year. We set about the path of building a new telemetry framework with the purpose of making this problem easier for a model like IRO. It was a hard road and one that we were blazing for the first time within our organization. In the end of our alpha we took this project and made it open to Intel internal departments. We call this step an “internal open sourcing”. Here was our moment of truth. Did this approach make sense or was it a “fail fast” project?

Well, I am sure you can guess I would not be writing a blog post about a failed project. Our internal “open sourcing” resulted in internal projects and teams talking to us about integrating it into their solutions, an Intel Labs group started contributing and using it in their research, and some very positive responses from a few customers. So we moved a little further along and decided that this little innovation might be worth putting out into the open for everyone.Because of this, it is my privilege to introduce you to a new software framework from Intel we call snap.

Snap is a telemetry framework written in Golang for the purpose of making the consumption of data center telemetry easier. Today, Intel is open sourcing snap under an Apache 2 License for everyone. You can find snap on Github: https://github.com/intelsdi-x/snap

Lets get to the good part about what snap does.
Snap provides the ability to do a few key things:
  • Define telemetry workflows and run them on a schedule
  • Provide an open plugin model decoupling actions in the workflow from running the workflow
  • Several operational improvement inspired by modern DevOps tool sets
  • Strong focus on exposing all state and commands with API

The point of snap is to get something out of a system and sink that data somewhere it is needed. A key concept to this is the idea that the telemetry is often reused. Obtaining telemetry like VM saturation or CPU usage is valuable to operational teams, systems concerned with accounting and chargeback, and schedulers looking to place the next VM. This idea of reuse was influential in how we implemented the snap plugin model.

In snap every important system action is performed by a plugin. You have three types of plugins:

  1. Collector – This collects telemetry from *something* and forwards it on.
  2. Processor – This transforms the telemetry in some way. Encrypting, changing an object model, serializing, machine learning at a node level, or even allowing a policy engine to be injected.
  3. Publishing – This is the plugin that sinks the data into another system that consumes the telemetry. This can be common messaging or databases like RabbitMQ, Kafka, MySQL, or InfluxDB. This could also be things like email, files, or custom publishing to a private API.

What is important in the snap plugin model is that each plugin operates independently and that the snap framework allows you to wire these together in multiple ways. You can use collector plugins to grab specific sets of telemetry, forward it through a processor that *learns* what normal is and filters out noise which then in turn publishes the filtered data into RabbitMQ for pickup by another system. And at the same time you can forward the same collector telemetry directly to InfluxDB to populate an operational dashboard. The goal with snap is to make the description of this to be declarative.

Plugins in snap are also loaded dynamically at runtime. This was a big requirement we wanted to meet to make snap more operationally friendly and enable the clustering I write about a bit further down. In snap all operations for plugins are completely dynamic. You can load new collectors, processors, or publishers at any time. Vice versa you can unload any of these at runtime. No restarting the service or having to implement configuration management on top of your telemetry daemon.

A part of the way we accomplish this is with the concept of the metric catalog. When you load a collector into snap the metrics (unique telemetry items) are added to a single catalog. It is this catalog that is selected against when you create a task and run a workflow. This abstraction means that you are not selecting the “Intel CPU” plugin but instead the specific Intel metrics you want to collect. This is important because we support the ability to dynamically upgrade plugins while a cluster of snap daemons is running. Metric selections in your workflow automatically will use the newest plugin version implementing that metric. This means as plugin creators (Intel is one, hopefully others will join) release new plugin versions, the customer using them can downstream and upgrade without service disruption. This counts for processor and publishing plugins also. And we make this even easier with our Tribe management I walk through below.

Snap already has a host of plugins available with today’s release including:

Collector:

Ceph
Docker
Facter
Libvirt
Intel NodeManager
Intel PCM
Linux Perfevents
PSUtil
Intel SMART (disk)

Processor

Movingaverage

Publisher:

SAP HANA
InfluxDB
Kafka
MySQL
OpenTSDB
PostgreSQL
RabbitMQ
Riemann

And we have plugins in flight for completion right now for Ethtool, IOstat, Nova, Open vSwitch, and OSv. For a complete list see our Plugin Catalog  which we will keep updated as things develop.

Plugin authoring itself is built to be easy to accomplish with some Golang savvy. See our authoring guide and our best practices document.

Right now plugins are normally written in native Golang. But we also support a JSON-RPC interface for writing plugins in any language. We have plans for writing plugin client libraries in Java, Python, Ruby, and C++ soon. Plugins are written and compiled separately from snap itself. This means you can choose your own license or even keep your plugins private or proprietary if you prefer – we prefer open sourcing :)

Controlling snap is just if not more important than what it can do. To start off we decided that all operations and data from the snap daemon would be over a REST API. Anything you can do can be controlled via this API. We provide a CLI tool called snapctl which provides a CLI interface for calling the snapd REST API. This choice to use REST was important as we want snap to be something another control system can manage or integrate easily into existing customer solutions. Snap does not require complex configuration management work to control service restarts or change. Everything is dynamic and available over the API.

We also made a strong effort to secure snap for this first release. We provide the ability to cryptographically sign compiled plugins and verify the signatures on the snap daemon before loading or running. We encrypt the communication channel between plugins and the daemon. And we provide the option to secure the REST API endpoint for snap.

But, the CLI and API are not the only tricks in the bag when it comes to snap. One of the key needs of the IRO model is that as the size of the resources grow, the work to manage and maintain doesn’t become too cumbersome. To this end we planned from day one to use novel ways to control snap and make management easier.

Within snap we implement this operational automation using a feature we call tribe. Tribe allows you to cluster a group of snap nodes into a “tribe”. This tribe can then implement a feature we call an “agreement”. This agreement has specific behavior that the tribe will agree on like running the same plugin or running the same set of tasks. This allows an end user to take an entire compute farm of snap enabled nodes, group them into a tribe, and implement agreements that they will all run the same tasks and plugins. If a user were to load a new plugin into any of the members of the tribe, the other members would recognize they need to load this plugin also, and begin to share around the plugin itself. The same goes for running a task. Creating a new task to run a specific workflow against any member of the tribe implements that task against all of them. This does not use a master so requests can go to any node in the tribe.

The end result is the operational cost of loading a plugin or creating a workflow task for one snap node is close to the same as it is for a 100 or 1000 snap nodes. For more info we have specific tribe examples and information here. We are looking to expand the tribe agreements to also contain things like configuration, logging, and more.

There is no possible way I could go over all the features of snap in one blog post. I could try to cover stuff like the extensible scheduling, workflow routing, metric query support, and more. But the team has done an amazing job trying to capture all of this in the documentation for snap.

I will mention that this is just the start for us. There are several core features that have not been implemented yet in this beta but are on the roadmap for the next big release. These include:

  • Windows Support
    • A big reason to choose Golang was for its cross-compilation abilities. We want to extend all the goodness of snap to the Windows world and not just Linux and OSX.
  • Distributed Workflows
    • Right now the workflows are performed on a single node at a time (collect=>process=>publish). We have the ability with tribe to discover and allow workflows to operate across snap nodes. When we built each component of snap we heavily decoupled core modules allowing us to enable specific roles later. We can eventually stand up clusters of just collectors that send their information to a cluster of processors which in turn send to a small group of nodes that publish into a system. This flexibility means we can reduce the impact on workload nodes and fully utilize specific hardware for things like encryption or machine learning.
  • Event Subscription
    • Right now the telemetry collected by plugins is done so on a schedule. We want to add the ability for events to trigger running the same workflows rather than being scheduled. This is an important features for more performant monitoring.
  • Routing expansion
    • Under the covers of snap is the ability to load balance multiple plugins. We have the ability to use this for enabling a greater scale for future snap plugins.

In addition we will be releasing a host of plugins for exposing Intel Architecture telemetry. We have our sights on powerful CPU and memory, networking, and specific workload metrics. Intel and the SDI team is committed to exposing as much as we can in 2016. We already have internal customers at Intel looking to utilize for their own needs.

With this open source release snap is in beta. We are looking for a few things now that we have it in the open:

  1. Comments/issues/feedback/bugs – If you find a bug we will fix it. If you want a feature we will look into it.
  2. Maintainers – Long term we prefer this project is a mix of people trying to solve the problem in this space. We want you to help. If you are interested in becoming a maintainer and have the chops. Reach out to one of the current ones on the README.
  3. Plugins – Build something. If it works tell us about it and we can add your repo to our Plugin Catalog. What if snap could collect from storage arrays, VMware clusters, or sink data into New Relic or others? We can do a lot. But the ecosystem can do so much more.
  4. Examples/Blog Posts/Demos – If you do something cool we will link it

So that is it. It is out there and ready for you to play with at home or work :)

We are excited to try and be a part of enabling the Intelligent Resource Orchestration model for our customers. And this is just the beginning. We are already down the path on some new questions around the IRO problem space so stay tuned for more things in 2016. And of course we are hiring. If this project or something like this would be interesting to you and you like working in healthy collaborative teams of good people, give us a ping at sdirecruiting@intel.com.

.nick
Life/Work

I’d far rather be happy than right any day

2014 was a great year. Hell, 2013 was too.

I worked with the best people I have met in my life. Some were friends before I worked with them. Some were friends after.

I wrote near a hundred of thousand of lines of code. I argued in front of whiteboards of detail 40 feet long. I beat myself up about times I was too harsh to those that were just ernest in their effort.

I spoke with passion about things I cared about in front of thousands of people. They listened and responded in kind. I found kindred spirits.

I met women and men that do the daily work. That make the magic happen though they don’t get the twitter followers or the acclaim to show it. I personally dealt with trying to resolve how to make things fair to them. Why do people follow or care about me when I am nothing compared to them?

I reconnected with my heroes. The men and women who built the foundation on which I stand. I got reinvigorated by the passion that we can be better than we were. The way we work on things, the way we communicate, the very way we look at the world of technology can be better. I am inspired.

I met great engineers. People who are leaders and innovators. Selfless, sacrificing, dedicated, and honest. People that get less credit but are valued above the gold of a good marketing campaign.

I work for a company that is a group of patient and persistent thinkers. I miss the company full of brothers and sisters at the same time.

I discover–in all the long hours of work and effort my daughters and son grew older. My world changed around me. I did good and bad at the same time.

2014 was a good year.

2015 will be better.

docker
Cloud, Fixes, Tools

Docker & Vagrant & VMware Fusion bug

I was playing around with my new Mac this week since I started at Intel on Monday. I was starting to play with the Docker provider for Vagrant when I ran into some issues with it.

For those that aren’t up to speed – Vagrant now supports Docker as a Provider via plugin. If you are running on a system without native Docker support Vagrant will bootstrap a box that is native and proxy Docker commands into it. Rather slick if you ask me.

Problem is there appears to be a bug with a recent release of the vmware_fusion provider plugin which is breaking this functionality with VMware Fusion. By default the Docker provider uses a box hosted on Vagrant Cloud called mitchellh/boot2docker. This box is a lgithweight VM with an ISO mounted that is used to live boot. The bug has something to do with mucking up IDE/CDROM binding for Vmware Fusion VMX files. I didn’t dive into the bug for details.

I did search for workarounds and found some. But they were all a bit specific and fragile. Instead of using them I decided to use the built-in magic that Vagrant already provides with their Docker provider that allows you to select the box you want to use. So instead of using the currently broken mitchellh/boot2docker (on vmware_desktop) we will select a different box.

We need a box with native docker support. An Ubuntu box would serve nicely but you can use your flavor of distro. There is the option to choose a box with a built-in docker stack. I chose to use Vagrant’s Docker installer functionality to automatically install the latest Docker. This takes a little longer upfront but ensures I have the new hotness.

I uploaded my setup to Github here: https://github.com/lynxbat/boot2docker-fixie-example

The instruction from the README:

git clone git@github.com:lynxbat/boot2docker-fixie-example.git
cd boot2docker-fixie-example/docker_container
vagrant up

This will bring up the vagrant host VM (box-cutter/ubuntu1404), build a Docker image, and then run a container exposing port 80 and mapping to the host vm on port 8080. Then after the container comes up just run:

rake demo

To open up the Apache2 default page in your browser (host_vm:8080). This is apache2 running in a container on top of your host VM. Hopefully this example helps someone else in getting Vagrant and VMware Fusion working again with Docker while the bug is being fixed.

PS –  there is a weird bug where you get this:

Stderr: 2014/06/12 13:49:01 Get http:///var/run/docker.sock/v1.12/containers/json?all=1: dial unix /var/run/docker.sock: permission denied

If you run ‘vagrant up’ a second time it usually fixes itself.

Cheers,

.nick

 

Commentary

Sell, Execute, Grow, Iterate

Sell the vision well and the doors will open. Very few good ideas get to happen without someone selling the idea to someone. It is the vision of it. The story that is told. The belief in something better that greases wheels. As a recent boss loved to say:

You can do anything you can justify

One of my favorite things to say is: execution is all that actually matters. Say what you will do (sell it) then do it. An inability to move from point #1 to point #2 in my personal belief is the greatest skill gap in technology today. It is one thing to think of great ideas. It is more important to think of realistic ways to make that idea become reality. It is interesting to see how experience matters here. Sometimes those with experience in the pain of execution not happening are wise enough to be ready the next time around. Specialized skills in organizing others, communicating across disciplines, translating the vision to implementors, and the ability to persistently push to deliver are what make amazing executors.

The tail end of sell and execute is ownership. Breaking the ice on a new idea and making the impossible possible comes down to awareness that everything changes. The same skills that are awesome for selling and execution may not translate to supporting and growing. In the execution phase the pain of missing skills or missed concerns can be smoothed over with super-heroic effort. But long-term growth and stability comes from feeling pain. You cannot properly fill a hole in the armor if you do not know it exists. Growth is the addition of people, pieces, and process that make something small and shiny evolve into something large and powerful. Growth happens out of pain. In sell and execute we focus on what is possible and ignore the pain. With growth we focus on the reality of now and the steps needed to permanently solve pain and allow for so much more.

Growth can lead to stagnation. Laying down roots and becoming a tree people can depend on is important. But a tree cannot change location easily. Changes will be required and some of these changes require large effort. They will look scary to the big tree that has grown up. Which means someone needs to sell the reason why change is important and set a vision for where to go next.

Do not attempt to execute before you sell. Do not grow before you execute. The grower looks to the seller. The seller to the executor. The executor to the grower. And so we iterate towards the impossible again.

.nick

time to go
Life/Work

Goodbye Zombies

It has been amazing. That is the only way to describe it.

Project Zombie was an experience filled with amazing challenges and people. And for me this season is over.

I will be leaving VMware soon to take a position at Intel as a Principal Cloud Architect. I will be focusing on helping them develop some amazing new things in distributed automation, DevOps, and more.
This is not just a big deal for me. I will also be moving the Weaver clan from our home here in Texas. We will be settling into Portland Oregon area and starting a brand new adventure. We recently took a trip to Oregon over Spring Break and fell in love with the state. My family is excited to start exploring a new part of the United States.

The team will go on without me. I am proud to have been involved during this period of building out vCHS. And I feel lucky I was able to help assemble this mutant army of coding cloud zombies. The greatest part of this transition is knowing that the team will easily fill the small void I leave behind because of how awesome they are at what they do. The team culture of strong collaboration, ownership in what you do, and thinking out of the box was something brought by each member and it will live on.

I have been in the EMC/VMware family for over four years. And it is not without some sadness I move over to Intel. I have nothing but good memories and wicked skills thanks to the amazing opportunities I had as a part of this family.

Only thing I can promise is you will see more from Nick Weaver.

Stay tuned

.nick

Commentary

The problem with musical chairs

You walk into a room. In that room are your coworkers. Some are good friends. Most are not.

musical_chairs_sq

You look around the room and see chairs all around. And in the middle is a table with a set of speakers. A person walks in and announces to the whole group that there is a critical need for everyone to sit down in the room. ‘But first some music will play and everyone will circle the room.’ they say kindly. ‘Then when the music stops every one in the room is each personally responsible for finding themselves a seat. It is most important that everyone has a seat and through cooperation we will accomplish that.’ and with that last statement they exit the room.

Somehow this sounds vaguely familiar to you. But you shrug your shoulders and wait.

The music begins, a bright marching chorus filled with horns and stringed instruments. You and the rest of the group begin awkwardly circling the room.

About a minute after you start circling you suddenly notice something odd. There appears to be fewer chairs than people. You count in your head again. No, you are sure this time. There are 10 people but only 7 chairs.

You tap the shoulder of the person in front of you, James from the Widget Office. ‘James. James, there are not enough chairs…’ James seems to ignore or not hear you. So you tap harder, visibly making James flinch under your stiff finger. ‘JAMES’ you say with a strong voice over the music. ‘There are only 7 CHAIRS. But 10 of US.’ James quickly glances back with a fearful look on his face and and waves his hand at you dismissively.

You feel confused. James seemed irritated and maybe a little scared with your information. You look back at Selma from the Counters Team. She is looking at you with a confident smirk. She heard your interaction with James. ‘Well, I don’t know about you. But I will have no problem finding MY seat. Maybe you should focus on your seat rather than worrying about others’ she snaps at you. Then she looks off into the distance with her confident smirk.

Now you are really confused. Why would we be doing this? Why is everyone ignoring the problem in the room? If it is important that everyone finds a seat and there are not enough seats, then we should be focusing on getting more chairs rather than marching to the music. What happens to those left standing?

As you look around the room you realize what is happening. Some people look you in the eyes with concern. They see you understand and their eyes are almost a warning. Some look at you with coldness. You can almost feel them counting you as one of the unlucky 3. And the rest stare straight ahead looking at no one. They are focusing on being ready for when the music stops.

The music builds and builds. Its volume increases and you feel the climax coming. Then suddenly, it stops…

Failure is not always a surprise. Sometimes it is seen well ahead of time. Integrity is calling out risks even if you may not be the one left standing. The difference between a leader and a follower is the ability to ask for more chairs when the time comes. Even if it means pausing the music. The leader is concerned with the group success and not just the odds of personal failure.

And always count the chairs.

Commentary

Effort vs Credit

In life, your effort will be judged by your peers, superiors, and dependents.

Sometimes there is effort you can make which is somewhat impossible for your superiors to fully grasp. Maybe it is too technical. Perhaps they are too busy. Or it may just be your inability to communicate the value. But you know why the extra effort is valuable. You know that someday someone is not going to run into a problem because you decided to do things right. You know that you might not even be around when that time comes. Which presents every technical contributor with a problem. Do I only commit the effort that will be recognized? Or do I hold myself to a higher standard and judge myself first?

Your boss is just a single factor in what affects your life and you will have many bosses in this modern world. My opinion is you always treat your effort in life as a reflection of how you want to be seen. Even if there is no guarantee anyone will ever see it. Not because you get credit from those above you. You do it because that is the person you want to be.

The funny thing about integrity in your work is that it does not go completely unnoticed. Your peers will see it. If you have subordinates they will pick up on it. And as it bleeds into your personal life, your family will see it as well.

There are a lot of things that can rot away at our integrity. Unrecognized work, wasted effort from poor communication, random events, and much much more. Sometimes it is hard to remember why you should do the right thing when the wrong thing wins.

In life, your effort can win more than credit.

It can make a better you.

NOTE: It seems some see this post as a rant about my current position. It is not in the slightest. In fact this post is all about letting go of trying to get recognition for things you can’t and instead focusing integrity first. Just to be clear this post is about my developing philosophy and not me complaining about the wonderful role I am in currently.