Monday, June 30, 2014

Troubleshooting: How to Spot Configuration Issues

Configuration issues can be hard to find.  The issue could be as simple as a missing item, as challenging as an issue introduced during the troubleshooting process or as complex as a configuration item that was set up and conflicts with the current objective.

Spotting Identity Issues with Identities and Role Definitions and Assignments


The most common identity configuration issues have to do with:
About Identity Overrides
Identity overrides can be used to assign an AD user more than one Unix Identity.  Overrides can happen at the Zone, Child Zone or Computer Levels.  Although identity overrides showcase the flexibility of the Centrify approach to UNIX identity management, they are generally a very bad idea.  Overrides promote identity fragmentation within the solution and in specific use cases (like filers) can be a challenge.  Ultimately Identity overrides promote what's known as a Centralized mess.  Administration of identities is centralized in AD but the UNIX namespace is not rationalized.

Nonetheless, overrides can be good for: migrations while rationalizing a namespace, legacy systems, mergers and acquisitions between different UNIX environments and even one-offs.

Classic zones traditionally had used the nss.passwd.override parameter, this troubleshooting purposely ignores this parameter since classic zones are not covered in this blog.

Determining Identity Overrides
The agent will use the PAMGetUnixName call to determine the user's UNIX login name, the key items to remember here are:

a) Is this change effective?
b) Is this a system local override (bad idea)?  If it is, has the centrifydc daemon been restarted?

The best way to make sure that system local overrides are effective is to restart the agent.  This is the only identity-related time that the agent has to be stopped and started.  The behavior varies in the latest versions of the agent, but I'd do a restart to be sure.

Spotting Authorization Issues


For RBAC to work, roles have to be effective, properly created, properly assigned and properly scoped.
The most common authorization issues have to do with:
  • Misconfigured Role
  • None or missing rights in Role
  • Improperly defined roles (another topic)
  • Unassigned Roles
  • Poor or Wrong scoping
Role configuration issues

1. Role is time-bound: this happens when the role has a time restriction and the user complains that they can't log in at certain times.

2. Role is not configured properly (System Rights):   this is very common. A role is created and properly assigned, but the role creator forgets to set the attributes for the role.  This pic illustrates the main issues:

3. Audit is being enforced (required):  the end user can log in to systems with standard edition, but not enterprise edition.

4. Roles that do not contain any rights:  this is very common.  A role is properly set up and properly assigned, but the creator forgets to add the proper PAM login or commands.  From UNIX, you can run an elevated dzinfo on the target user that is having the issue with the role.

$ dzdo dzinfo yash
...
  PAM Application  Avail Source Roles
  ---------------  ----- --------------------


Privileged commands:
  Name             Avail Command               Source Roles
  ---------------  ----- --------------------  --------------------
  (yash has no privileged command rights)

This excerpt of dzinfo indicates an empty role (no PAM or privileged commands); a role needs at least a PAM right to allow the asignee of the role to log in.  You can also verify this on Access Manager by clicking on the role in the left pane and looking at the detail in the right pane.

5. Unassigned Roles

The RBAC cadence in Access Manager is as follows:
  1. Create the PAM rights and Commands
  2. Create and configure the role
  3. Assign the rights to the role
  4. Assign the role to an AD principal (preferably a group) at the correct scope  (zone, child zone, computer role or computer (override))
  5. Populate the group and make sure the user's have UNIX identities in the zone.
Unfortunately a lot of end users think that by performing steps 1 to 3, the role is already assigned and to make matters worse, the design may make matters harder or easier to troubleshoot.  My recommendation is:
  • Identities are assigned at the highest levels with overrides used as a last resort.
  • Role assignments are performed at the computer role or child zone (with high-level roles assigned for a very small set of trusted administrators)
  • Very rarely use system overrides.
The tools available to perform high-level troubleshooting are:

a) The user's effective rights utility of Access Manager:
Do not attempt to troubleshoot at the system level if things don't look correctly here (it may be a time issue). 

b) The dzinfo command
c) The Centrify Report Center

6.  Role assigned, but expired

With Centrify, UNIX (or Windows) roles can be assigned during a specific window of time (break/fix, change control, etc).  An expired role can stop the user from logging in or performing admin duties.


Bottom-line, understanding why you're doing what you're doing is crucial, along with a configuration that makes sense.

Tuesday, June 24, 2014

Utilities: addebug

Background

The addebug command is used to use the log feature of the Centrify client for UNIX, Linux and Mac OS X.  Logs written to /var/log/centrifydc.log, however in HP-UX the location is /var/admin/syslog.  Only turn on debugging if you're troubleshooting a problem and you'll have to elevate (with sudo or dzdo) to use addebug and review the log.

Location

The utility is located in the /usr/share/centrifydc/bin folder.

Basic Usage
  • Use /usr/share/centrifydc/bin/addebug on to start debugging
  • Use /usr/share/centrifydc/bin/addebug off to stop debugging 
  • Use /usr/share/centrifydc/bin/addebug clear to clear the logs
For more information, read the manual page for addebug  (man addebug).

What to look for

Modules
Centrify  implements directory lookups with Name Server Switch (NSS) and Pluggable Authentication Modules (PAM) for authentication, this means that you need to become familiar with some of  these calls:
  • NSS calls:  These are name server switch function calls.  For example (oversimplifying) an application may use a call to determine the user's UID from the login name.  These calls start with "NSS".
  • PAM calls:  These function calls implement the account, authentication, session and password modules that are implemented with the solution. These calls start with "pam_"
File Descriptors
FDs identity the transactions, they make it easy for the log reader to follow the same transaction.  They are labeled with "fd:nn"  (nn is the descriptor number).  

Keywords, Phrases and Functions
During the troubleshooting process, you'll become familiar with several keywords that will help you determine what happened during the transaction.  For example, here are a few:
  • pam_sm_authenticate:  search for this call to determine the beginning of PAM authentications.
  • "User is ours" / "User is not ours":  This phrase appears in a file descriptor when the function PAMUserIsOurResponsibility determines that the user is indeed an AD user that needs to be processed or not.

Monday, June 23, 2014

Troubleshooting: Ruling-out issues with the user's AD Account

Verifying that the user's AD account is OK

To verify that the user's account is OK you'll need access to any of these tools:

  • Active Directory Users and Computers
  • Centrify Access Manager
  • The Unix/Linux command line of a Centrified system

You MUST know how provisioning happens in your organization.  Otherwise you're trying to troubleshoot something that you DON'T understand.  How smart is that?

What you're looking for to rule out issues with the AD account

The user account has to be enabled, not locked, not expired, if it has restrictions (logon hours/logon to) those must make sense, they need a Unix Identity (login, uid, gid, etc.) and to belong to the right group if you are granting roles based on group membership.

From Windows with Active Directory Users and Computers

The account tab of the AD user contains most of the relevant info.


The Centirfy Profile tab can expose the user's UNIX identities.

Finally, the "member of" tab can expose the groups the user belongs to.  If any of the groups is used for the purposes of role-based access.  I know to look in here because I understand the process.



From Windows using Centrify Access Manager 

Determining if the user has an identity

Under the Zone/Unix Data/Users node, you can see a list of users.  Double-click the user to see the identity. If the user is not on the list, you may be facing time issues or provisioning issues.

Determining if the user has a role
The "Show Effective UNIX rights" utility is the best tool to determine who has access to which system and at which level.  Just right lick the zone or system and select the function.


From Unix/Linux using CLI tools
You can use the adquery user <user> with the -A switch.  Depending on how it is run, it may produce different results.  If elevated (with dzdo in this case) provides all the information required about the account.
This output tells me that the user has a Unix Identity (because unixname, uid, gid, etc are present), the account is neither expired nor locked, and judging from the output of dzinfo, the user does have a role that allows him to log in and even some privileged commands.

Compare with this output. Here I deprovisioned the user, note the output of a simple adquery user <user> that states that the user is not a zone user.  In this case the user is missing the UNIX identity, so we used the AD user logon name to get the information about the user with the -A switch.
This looks like a good AD account that does not have a Unix Identity. 

Let's put all this together in this video:


Using Switch User (su) or Kerberos tools
If you have a privileged account, switch user (su) can be used to isolate a particular application (like SSH) from a troubleshooting situation.  Assuming the user does have an identity and a role, but can't log in via SSH, you can try to elevate.

george@cen1 ~]$ dzdo su cosmo
AD Password:
Created home directory
[cosmo@cen1 george]$ id
uid=1149240408(cosmo) gid=1149240408(cosmo) groups=1149240408(cosmo),10024(webadmin) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[cosmo@cen1 george]$

In this sequence, George elevated with Centrify sudo to switch to Kramer. If Kramer is having issues with SSH login, this eliminates his Centrify identity and role from the troubleshooting table.

If you want to take things further and eliminate both the logon application (that may use PAM) and NSS and test the AD account directly, you can use kinit and klist from  /usr/share/centrifydc/kerberos/bin.

[george@cen1 ~]$ /usr/share/centrifydc/kerberos/bin/kinit cosmo.kramer
Password for cosmo.kramer@CORP.CONTOSO.COM:
[george@cen1 ~]$ /usr/share/centrifydc/kerberos/bin/klist
Ticket cache: FILE:/tmp/krb5cc_cdc1149240406_EINhqm
Default principal: cosmo.kramer@CORP.CONTOSO.COM

Valid starting     Expires            Service principal
06/22/14 14:11:36  06/23/14 00:11:40  krbtgt/CORP.CONTOSO.COM@CORP.CONTOSO.COM
        renew until 06/23/14 14:11:36

I would argue that using kinit/klist are the fastest way to determine if any AD account (if the credentials are known) is working correctly. It also ensures that the communication path between the Centrify agent and AD is working as expected. Notice that I did not use the user's unix name (cosmo) but their AD account (cosmo.kramer).

In the next posting, we'll continue our troubleshooting and will look at what happens in the logs.

Troubleshooting: Understanding How Time Affects When Changes are Effective

The best way to summarize how time works in an AD environment that leverages Centrify agents for Unix, Linux and Mac OS X is to look at this formula:

Effective Changes Σ(Provisioning, AD Replication, Cache Flush Interval)

Changes are additions, deletions or modifications of AD objects (LDAP).  This excludes real-time Kerberos transactions like authentication or password changes.

Provisioning


In Centrify for Unix/Linux, we call provisioning the action of assigning an existing AD principal (user or group) a UNIX identity (login,UID,GID,Home,GECOS,Shell).  In the case of users they also need a role to be able to log into systems; both actions can happen manually or automatically (via an Identity Management Solution that leverages the Centrify APIs, via the  Zone Provisioning Agent utility, or programmatically via Centrify PowerShell or Centrify adedit).
For example, in my environment I use ZPA and I assign the roles to AD groups so the management is simplified.  I've also nested my role-granting-group into the provisioning group.  This means that my modified equation is:

Effective Changes @ Contoso AD Replication of Group Membership + ZPA Polling Interval + AD Replication of ZPA Provisioning + Cache Flush Interval)

Notice how my provisioning design has an impact on time.  I have potentially two actions that require AD replication and an agent that I've set up to poll every 15 minutes.


This means that a provisioning action can take as long as 15:30 from a provisioning + AD perspective in an intra-site scenario assuming (in the case of a user) that both the identity and the role were granted at the same time.

AD Replication

If you're relatively new to Active Directory and don't have an idea on how AD replication works, read these links and come back:

Basic Concepts: http://technet.microsoft.com/en-us/library/cc731537%28v=ws.10%29.aspx
How it works:  http://technet.microsoft.com/en-us/library/cc772726(v=ws.10).aspx 

In case you did not read the links above, the basic problem is that when you have a replicated database changes take time to propagate.  AD Sites (fast connected subnets) have internal (intra-site, shorter) and external (inter-site, longer) replication periods.

Intra-site replication is somewhat predictable.  A DC will notify to its nearest partner of a change within 15 seconds and this will cascade within a site.  Older versions (like Windows 2000) were set at 5 minutes.

Unfortunately in larger global environments AD inter-site replication times vary. It all depends how the AD team has tuned the environment based on the inter-network topology.   This is why to be an effective Centrify administrator you need to be in constant communication with your AD team. Replication affects availability for users and definitely affects your SLAs.  The best solution at a higher level is to use Microsoft's recommendations for AD replication in large environments and that they maintain current Subnet, Site and Domain Controller information.  This is very important.


That being said, there are things that you can do to make sure things happen faster. 
For example, if there is a new add/move or change and the target is a key server in a specific location, you can log into the server and find out what domain controller the server is currently talking to with the adinfo command (or adinfo --server).  If you're making the provisioning via ADUC, adedit or Access Manager, make sure you're talking to the same DC.  At that point you basically have eliminated AD time from the equation and you can issue an adflush when these changes are made.


If you're using The Windows PowerShell Centrify commandlets, you can use the echo %logonserver% command in a windows prompt to find out which domain controller you're currently talking to.

Centrify Agent Cache

We've talked about the cache in previous posts;  however, all you need to know that to improve performance and to provide high-availability the Centrify agent for Unix, Linux and Mac does not bother AD persistently to ask for changes;  this happens by default every hour.

Putting it all together

Looking back at my example, this means that in an intra-site scenario with two domain controllers like in my lab contoso environment, the length of time for an effective for a user change can be as long as 65:30 minutes. Because it takes up to 15 seconds for replication to happen on the provisioning/role granting action, up to 15 minutes for ZPA to poll, 15 additional seconds for the ZPA change to propagate and up to 60 minutes for a Centrified system to update its cache.

How to perform manual add/moves/changes in an effective matter

  1. Determine the key system, and issue an adinfo command to determine the domain controller the agent is talking to.  (adinfo --server)
  2. With that information, connect your ADUC or Access Manager consoles to the target DC.  (on AM, use the "Connect to remote forest" option; on ADUC use the "Change Domain Controller" option).
  3. Perform your changes in AD (add/moves/changes)
  4. Perform an adflush in the target system  (if it's a local override, you need to restart the agent)
  5. Verify the changes with adquery, dzinfo, etc.
Note:  Flushing the cache (by interval or manual with adflush) is an expensive operation, I recommend that you keep the default cache flush interval of 3600 seconds (one hour) and try to establish a proper Service Level Agreement for these operations.

Sunday, June 22, 2014

Troubleshooting: My brand new users aren't available or can't log in

Background

This is probably the most common task to troubleshoot is user's availability in a system.  Granting access to a brand new user (or users) to a system.  The user may or may not be available to the system or can't log in.
In this initial article, we will cover the mechanisms to minimize replication and techniques on how to rule out issues with the user's AD account.

What you've probably done so far:
  1. You set up a Centrify zone or child zone with the corresponding user and group defaults
  2. You've configured authorization (computer roles, UNIX rights, roles and role assignments)
  3. Installed the Centrify agent and joined one or two Unix or Linux systems
  4. You're trying to test a user or two but have no luck
Symptoms:
Unfortunately, your user does not show up in the system (adquery user does not show the user or users), possibly the user shows up, but can't log in;  Maybe the user shows up and can log on after a long time (undetermined time).

Remember:
By default in zone mode (unlike in Express mode) no users have access to the Unix, Linux (or Windows) systems that belong to the zone.  

Troubleshooting Checklist
We can use a two-category technique to troubleshoot this issue.  We can divide it into time and configuration items.  Time can be proactively managed and configurations can be verified and ruled out.
  • Enough time should have passed for
    • Any Identity Management tool to work (IdM solution, ZPA or program)
    • AD replication to complete
    • The cache flush interval complete (or the adflush command issued in a target system)
  • In order to access a Unix/Linux system, a user needs an identity and a role
    • To get an identity: the AD user has to be added to the zone to get a login, UID, GID, GECOS, Home and Shell based on the zone defaults to get an identity
    • To get a role:  the role has to be assigned to a user directly or to a group that the user is a member of.
  • The role needs to have at least the intended PAM logon right for the user to be allowed to log in.
  • The role assignment has to be properly scoped (Zone, Child Zone, Computer Role or System level)
  • The role has to be properly created:  logon options, hours, auditing, etc.
  • The user's AD account should not be subject to any restrictions (logon, computer)
  • The user's AD account should be usable (not expired, disabled or locked)
  • The user is not listed in the /etc/centrifydc/user.ignore file
  • The user principal has to be readable on the correct side of the one-way trust

Where to go now?

Introduction to Troubleshooting

Troubleshooting

This is my attempt to start a troubleshooting topic for Centrify Server Suite.  I use the word 'attempt' because people have two attitudes when it comes to problem solving.
  • The wrong way (e.g. brute force):  I was tasked to do something (get from point A to point B) and suddenly I had an issue, so I use anything I can (help, Google, etc) to find references to the problem, attempt many approaches until I fix it and I move on.
    We've all used the wrong way.  The problem lies in that the issue I had wasn't relevant to the bigger goal I want to accomplish, so I pay very little or no attention.  All I want is to get from A to B.
  • The right way (e.g. understand the issue, find the root cause, solve it so it does not happen again):  Here you go to the documentation, try to understand how things work, determine the issue and implement a mechanism that stops it from happening again (even if it means rethinking how to get from A to B, or maybe determining that the journey is not from A to B, but from A to D).
    The right way is typically used when there's value to what we're doing or that we'll be stuck doing this work, so we might as well learn to do it well.

Troubleshooting Ground Rules

  • We will try to use "the right way"
  • We will summarize the problem just in case you want to solve it the "wrong way"
  • There is no structure, the troubleshooting topics will be selected at random.
  • We will try to keep things simple, but sometimes we'll go deep and show you some logs :)
  • Same rules of the blog:  no Express, no Deployment Manager, no Classic zones.

What do you need to know to be an effective Centrify SME 

  1. You need to know the basic security principles and security controls:  Centrify for Servers is a Security product it deals with Authentication, Authorization and Auditing.  You need to understand why you're doing what you're doing.
  2. You need to understand Active Directory:   I can't stress this enough.  If you only know about UNIX, Linux or Mac OS X you are missing  60 to 70% of the knowledge needed to be an effective Centrify subject-matter-expert.  If you have any doubts about these terms:  LDAP, Kerberos, Group Policy, Sites and Services, Domain Controllers, DNS, Global Catalog, FSMA Roles, SRV Records, Replication, UPN, SPN, sAMAccountName, PKI, SysVol, Built-in groups, Domain Trusts, Forest, Domain, Site, etc.
  3. Basic understanding of TCP/IP including DNS, TCP/UDP ports, Ephemeral Ports
  4. Understand Kerberos (you don't need to become an expert).
  5. Do not pretend that Windows does not exist.
  6. If you are supporting Linux/UNIX, know at least a base level of what you're doing.
  7. Know what the UNIX  Name Server Switch (NSS) and Pluggable Authentication Modules (PAM) frameworks are about.
  8. For User Suite:  Read a Mac Admin Book, Read about federation standards (federation <> authentication), Office 365, etc.

My Advice

The best advice I can give you when it comes to troubleshooting is to always step back and ask yourself (without any implementation details) - "what is it that I want to accomplish?" "does it make sense?"
If you are trying to do something and you don't know WHY you're doing it - you are in trouble.
Also, recognizing competency gaps is important; if you don't have a grasp on the knowledge outlined above you may be in over your head;  the good thing is that we can always learn.

To find troubleshooting topics, follow the category bar.

Good Luck!!