Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent, e.g., ``Go to the large green bowl''. The training process, then, interrelates the different modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at run time on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how the introduced approach can learn language-conditioned manipulation policies for a seven degree-of-freedom robot arm and compare the results to a variety of alternative methods.