This paper describes a framework that enables robots to efficiently learn human-centric models of their environment from natural language descriptions. Typical semantic mapping approaches are limited to augmenting metric maps with higher-level properties of the robot’s surroundings (e.g. place type, object locations) that can be inferred from the robot’s sensor data, but do not use this information to improve the metric map. The novelty of our algorithm lies in fusing high-level knowledge that people can uniquely provide through speech with metric information from the robot’s low-level sensor streams. Our method jointly estimates a hybrid metric, topological, and semantic representation of the environment. This semantic graph provides a common framework in which we integrate information that the user communicates (e.g. labels and spatial relations) with metric observations from low-level sensors. Our algorithm efficiently maintains a factored distribution over semantic graphs based upon the stream of natural language and low-level sensor information. We detail the means by which the framework incorporates knowledge conveyed by the user’s descriptions, including the ability to reason over expressions that reference yet unknown regions in the environment. We evaluate the algorithm’s ability to learn human-centric maps of several different environments and analyze the knowledge inferred from language and the utility of the learned maps. The results demonstrate that the incorporation of information from free-form descriptions increases the metric, topological, and semantic accuracy of the recovered environment model.