I need to crawler a site to make some checks to know if the URLs are available or not periodically. For this, I am using crawler4j.
My problem comes with some web pages that have disabled the robots with <meta name="robots" content="noindex,nofollow" /> that make sense to not index this web pages in a search engine due to the content it have.
The crawler4j also is not following these links despite disable the configuration of the RobotServer. This must be very easy with robotstxtConfig.setEnabled(false);:
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...
But the described web pages are still not explored. I have read the code and this must be enough to disable the robots directives, but it is not working as expected. Maybe I am skipping something? I have tested it with versions 3.5 and 3.6-SNAPSHOT with identical result.