if !blogClogged

Software development and other stuff.

TFS Work Item Type Definition Field Name, RefName, Label Extract with Regex

leave a comment »

I’m no whiz at Regular Expressions (Regex) which you will see later in this entry where I post some regex commands. A number of times I have wanted to extract field information from the TFS Work Item Type Definition (WITD) that can exported as XML. Usually it is with a client and I am wanting to find a list of custom or standard field names and perhaps also a list of the same or subset of fields used in the form. I like to compare the latter with the former to see which fields may be defined but not exposed.

More times than not my WITD file layout is nicely consistent with every other time I have exported a WITD. Naturally I’d like to use an xslt or a regex to extract the data I want. Actually, I’d really like to use an xslt so I could reformat the output of the values I want into a tab delimited file so I could easy load into Excel or other table like structures. My xslt is even worse than my regex. So each time I end up with regex commands and the next time I can’t recall where I put the file that I stored the regex in for future use. So I recreate them. I think I do get better at the regex every time I have to recreate them. When you look at the regex below you’ll get an idea of how horrible I was if these are better. Maximum performance is not a concern. I’m running each  regex on demand against a handful of files.

With all that said the caveats are that do to the use of the Lookahead and Lookbehind constructs and likely some other constructs these regex will not work with every editor, regex engine and probably not even C# regex. I have not tried them in Cygwin but I believe I tried them and they failed in Windows findstr command. They did work superbly well in EditPad Pro 7. This product comes from Just Great Software which is owned by Jan Goyvaerts (blog). Just Great Software makes not only EditPad Pro but Regex Buddy, PowerGrep, Regex Magic and some other tools I am not as familiar with. ***This is just a personal comment on EditPad Pro – I get nothing for saying this***.  Since Notepad++ uses POSIX regex they most certainly don’t work there. If you are wanting to use an editor and regex you should really check out EditPad Pro. Not only does it handle regex it handles crazy regex like mine which include lookbehinds containing multiple [0-n] character references. Most engines don’t like indeterminate regex such as wildcard characters as part of a lookbehind. Other engines also usually don’t like much more than the most rudimentary regex expression within a lookbehind. The one inside EditPad Pro handled lookbehinds with a non-collecting group containing text values separated by logical or e.g. (?:Microsoft|System) and multiple wildcard character references in the same lookbehind and didn’t complain a bit.  When EditPad Pro has highlighted the matching portions of the lines you can cut the highlighted parts from the document or you can copy the results. When copying the results you get a nice list with a line of results for each line of the parsed document. The copy will paste perfectly into Excel.

Now for the regex. This is as much for me to be able to remember as it is for you to reference.

#extract control custom field names
(?<=<Control.*? FieldName=")(?!System|Microsoft)[^"]+

This regex gets the refname for each defined custom field in a work item definition. It looks for <Field (set for not case sensitive), some characters followed by name=”, some more characters followed by refname=”. All of that should come before (lookbehind) some characters that cannot be have the word System or Microsoft in it (the refname we are interested in for the line) and those characters of interest end with a “, but we don’t select the ending quotation mark. The description is not exactly how the regex engine finds the text for a good understanding check out Regular-Expressions.info. I won’t explain it correctly if I try and the explanations at Regular-Expressions.info are much better. This is also a website by Jan Goyvaerts.

The above regex can break down in any number of ways. The most notable is that I used finding the token System or Microsoft as would be used in the refname of out-of-the-box fields. If you happened to use System or Microsoft as the leading name for your refname in a custom field then it will exclude it. The fix for that is to stop using one of those words as the starting of your refname. Frankly, I can’t imagine anyone hitting that issue. Note that instead of having (?<=<Field.*?refname=”)*** ,where *** represents the regex shown just before the *** is partial and continues on, I included a name=” as part of the lookbehind. The inclusion of the name token is to differentiate the field nodes that make up the basic field definition and field nodes that may appear under state nodes, transition nodes and link column definitions. Without the name token being added in as a criteria for a match duplicate field names will appear as they are picked up from those other nodes.

#extract only custom field names

(?!.*refname="(?:System|Microsoft))(?<=<Field.*?\sname=")[^"]+

In this regex above I am extracting the field names for the custom field entries. This would be the value of the name attribute in the defining Field nodes and should match in order and in result count the custom field extract. This time to determine if the this is a potential matching node it will look at each line and do a lookforward. It will be searching for 0-n characters followed by refname=” and that followed by either System or Microsoft. The lookforward begins from the main expression not necessarily from the beginning of the line. In this instance it would look for the first quotation mark and each subsequent quotation mark. If on a quotation mark it looked forward and found it matched then it would then try the lookbehind criteria. It will see if the quotation mark is immediately preceded by some amount of characters that are immediately preceded by the name=” token preceded by a space \s preceded by 0-n characters, preceded by the text <Field. Once it can meet the look behind criteria and has already met the lookforward criteria then the text between name=” and the following quotation mark would be selected for the result set. (My standard caveats apply on my description of how it resolves the matches). The big thing to notice in this instance is that the lookforward matching expression is actually a negative criteria. (?= would indicate a look forward and a matching portion would meet the search criteria but in the case the (! indicates that if the text meets the matching criteria then the match is a success but the negative nature of the lookahead means that the lookahead failed then the match doesn’t count. In this case that is how refnames starting with System or Microsoft are excluded from the results and the lookahead allows us to check those values that come after the actual text we wish to match for a result.

One thing to note on the regex to extract custom field names. Notice in the lookbehind group (?<= that .*?\sname=” was used instead of .*?name=”  . The difference is that instead of letting the expected space just before the name attribute be matched via the wildcard character token .*? an intentional space \s was used. Without the intentional space the name=” portion would have matched the attribute refname=” . Due to the structure of the text word boundaries are not used in the matching. This allows for a search to be successful anytime it finds the correct sequence of characters together and [space]name=” matches .*?name=” as well as refname=” does. For this reason the intentional space was put in so that it could no longer match refname=”.

#extract control custom field names

(?<=<Control.*? FieldName=")(?!System|Microsoft)[^"]+

I often like to extract out of the WITD the FieldNames for the form control entries. I do this to bump up against the refnames of the fields extracted in the prior regex examples so that I can see if any custom fields are declared but not surfaced onto the form. I do this because if I find such fields it means that most likely either the field is used for integration or some backend process or that the field is not valid. In this case it’s fairly straight forward. I have a regex with a lookbehind (?<=. The lookbehind will occur when it finds a “ preceded by some text which cannot have the words System or Microsoft in them (Microsoft123 or System123 would qualify). If that criteria is found then it starts the lookbehind using FieldName and Control and an < among other potential characters in the string.

#extract control custom field labels

(?<=<Control.*?FieldName.*?Label=")(?<!FieldName="(?:System|Microsoft).+?)[^"]+

I often like to pull the custom field label used on the control as that name is often more informative to the meaning of the field than the fields name or refname. The regex shown just prior to this paragraph allows for the label to be extracted. The result order and result row count should match with the output from the regex to extract the control custom field names. In this instance the regex uses two lookbehind groups. One lookbehind is the normal positive lookbehind that is a success on a match while the other is a negative lookbehind that is a success when there is no match. In this case the negative lookbehind (?<! is trying to match to FieldName=”System or FieldName=”Microsoft plus some additional characters. If it matches it fails the lookbehind which is what allows the regex to only extract the labels for custom fields. The positive lookbehind (?<= is defining the expression that should be matched for a success by the lookbehind which when found just prior to some text followed by a “ will yield a result of the desired text not including the “ [^”]+

Without doubt these regular expressions can be improved upon. I’m lucky to get them built at all it often seems. This may be of value to some folks like myself that have a desire to extract out specific information from the TFS work item type definition file without a lot of hassle. Oh and yes there are other ways to do these things without using regex but none are likely as fun.

Advertisements

Written by Michael Ruminer

June 28, 2013 at 2:06 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: