How Many Times Must a Tech Provider Download the Same Listings?

I received a call recently from an MLS administrator who wanted to talk about a RETS issue that had been bothering him. His MLS charges a small fee to subscribers for a RETS feed; the fee covers the costs related to the feed, including compliance audits. He was noticing that many of the RETS credentials that subscribers were paying for weren’t being used and thought this was a bit of a mystery. Should he disable the unused RETS credentials and stop charging the subscribers? That course of action would make sense if his subscribers truly no longer needed the data. But there was a more likely culprit behind most of his mystery.

Quite often a subscriber’s RETS feed isn’t just associated with the subscriber, but with a third-party vendor providing IDX, VOW, CMA, statistics, and/or broker back-office systems to multiple MLS subscribers. Let’s say the vendor has downloaded the IDX data on behalf of one broker. If the vendor has 19 more customers associated with that MLS, does it really make sense for the vendor to download and store the data 19 more times, using the additional 19 RETS credentials? That seems like a real waste of server, bandwidth, and storage resources. On the other hand, suppose the vendor re-uses the credentials. Further suppose that the MLS administrator turns off unused credentials, the subscriber whose credentials have been used by the vendor to download data goes inactive, and his or her credentials are disabled by the MLS. The flow of data to the other 19 websites will be cut off. That’s not good!

Besides the potential for data disruption, there are other reasons why an MLS administrator may not like re-use of credentials:

1.    Credential re-use takes authorization control out of the hands of the MLS. If the vendor doesn’t know that a subscriber whose credentials they aren’t using has gone inactive, the vendor may accidentally service him or her using data obtained using another subscriber’s credentials.
2.    Similarly, re-use may defeat opt-outs for individual uses.
3.    The problem is actually even more complex if the vendor has multiple products. The vendor may download a superset of all data they need for a broker back-office use. Then, by re-using a subset of the data for an IDX site, the vendor may accidentally use fields and listings in certain statuses that would not normally be available to the IDX feed, inadvertently using the data inappropriately.
4.    Credential re-use partially defeats the use of data seeding, i.e., trying to figure out where exactly there’s a data leak.

Having unpacked some of the issues, there seem to actually be two questions regarding RETS credential re-use that need to be considered:

1.     Is it okay to re-use data feed credentials for multiple parties with the same use?
2.     Is it okay to re-use data feed credentials for one or more parties with different uses?

So, re-stating the conundrum simply: it’s terribly inefficient for all parties when vendors download and store multiple copies of data, one for each customer and credential, but there are valid reasons why MLSs have looked negatively at the practice of credential re-use. How do we solve this for everyone?

There are several possible ways to address the authorization control and opt-out issues including, but surely not limited to, the following:

The vendor can log in using all MLS-provided credentials at least once per day to figure out what subscribers no longer have rights to use data based on RETS login failure. They won’t download data with each login, just for one of them. But this way, the MLS will have a record that the vendor has checked whether a login / use is still active on the RETS server and should have taken steps to eliminate data use for that subscriber.

The vendor can be given a RETS login by the MLS that gives the vendor access to the roster, limited to a subscriber identifier and status (active, inactive). The vendor can use this to check if they need to stop re-using credentials on behalf of a specific customer.

RETS standard and server functions can be designed to return validation codes for all authorized specific MLS users and uses based on a single login credential, and return data based on that information. This will directly reflect the kind of master agreements and addendums that many MLSs have with these vendors already. If no MLS users are active and related to a vendor credential, the vendor credential will not provide data access.

The inappropriate data use issue is a bit trickier. It is an issue that can be mitigated today to some degree via very clear license agreements, vendors being careful to use the data subsets as specified by those agreements, and by MLSs auditing the end-uses of the data (i.e., the IDX websites and VOWs) – something they should be doing anyway. Additional mitigations may require some RETS standard and server-side function enhancements. For example, additional usage opting information can be passed to vendors where relevant. Also, a server-side function could be created to efficiently determine whether several credentials provide different data for a query – without downloading and comparing the data to the data on the client side. Knowing that different credential use would provide different data may make it easier for a vendor to know whether re-use is appropriate or not.

I don’t think there’s a way to fully resolve issue the data seeding issue while allowing credential re-use but tracking an issue down to who received the feed is still possible. Vendors just need to cooperate with any seeding investigation to help figure out what specific usage is involved. Data seeding is only of use in a very limited subset of illegitimate use detections anyway.

There are more conversations to have on this subject, looking at additional business and legal issues as well as technical reflections of those issues, but this is a starting point. Let’s figure this out, so that RETS service can be efficiently provided to stakeholders while addressing legitimate issues that arise with that efficiency. What’s next? Let’s discuss these and other ideas for solving the issue here on this blog, on Facebook, and perhaps at the upcoming RESO meeting and see if some consensus can be reached among both vendors and MLSs. If changes to RETS are desired, this can be dealt with in RESO workgroups and implemented by vendors as need be.

I know many vendors that simply must engage in credential re-use so they don’t overwhelm MLS RETS servers and so they don’t needlessly increase their costs to service multiple customers – but they don’t like being in violation of some of their license agreements with regard to credential use. I’ve even had clients fine such vendors – and while this is in accord with the letter of some current license agreements, it’s really not fair. These are not “bad vendors.” By not defining our standards, process and legal agreements to reflect the technical reality of data aggregation and use, we’ve created this ugly issue together. But together, we can solve it, and we should do so as quickly as possible.


In 2011 and 2012, Realtor.com was under the gun to solve the problem it had with screen scrapers, where sites were “scraping” data off of their site and using it in unauthorized contexts. For those that haven’t been watching industry news sites discussion of screen scraping, scraping is when someone copies large amounts of data from a web site – manually or with a script or program. There are two kinds of scraping, “legitimate scraping” such as search engine robots that index the web and “malicious scraping” where someone engages in systematic theft of intellectual property in the form of data accessible on a web site. Realtor.com spent hundreds of thousands of dollars to thwart malicious scraping and spoke about the screen-scraping challenge our industry faces at a variety of industry conferences that year, starting with Clareity’s own MLS Executive Workshop. The takeaways from the Realtor.com presentations were as follows:

1.    The scrapers are moving from Realtor.com toward easier targets … to YOUR markets.
2.    The basic protections that used to work are no longer sufficient to protect against today’s sophisticated scrapers.
3.    It’s time to take some preventative steps at the local level – and at the national/regional portal and franchise levels.

Clareity Consulting had wanted to solve the scraping problem for a long time, but there hadn’t been much evidence that the issue was serious before Realtor.com brought it up – and there hadn’t been any evidence of demand for a solution. Late last year, Clareity Consulting surveyed MLS executives, many of whom had seen the Realtor.com presentation, and 93% showed interest in a solution. Some industry leaders also stepped up with strong opinions advocating taking steps to stop content theft:

“It is not so much about protecting the data itself but protecting the copyright to the data. If you don’t enforce it, the copyright does not exist.”
– Russ Bergeron

“I am opposed to anybody taking, just independently, scraping data or removing data without permission…..We have spent millions of dollars and an exorbitant amount of effort to get that data on to our sites.”
– Don Lawby, Century 21 Canada CEO

The problem didn’t seem to be stopping – in 2012 (and still, in 2013) people continue to advertise for freelancers to create NEW real estate screen-scrapers on sites like elance.com and freelancer.com. Also, we know that some scrapers aren’t stupid enough to advertise their illegal activities. So, Clareity began working to figure out the answer.

There were six main criteria on which Clareity evaluated the many solutions on the market. We needed to find a solution that:

1.    is incredibly sophisticated to stop today’s scrapers,
2.    scales both “up” to the biggest sites and “down” to the very smallest sites,
3.    is very inexpensive, especially for the smallest sites – if there is any hope of an MLS “mandate”,
4.    is easy to implement and provision for all websites,
5.    is incredibly reliable and high-performing, and
6.    is part of an industry wide intelligence network.

Most of those criteria, with the exception of the last one, should be self-explanatory. The idea of an “industry wide intelligence network” is that once a scraper is identified by one website, that information needs to be shared so the scraper doesn’t just move on to another website, which takes additional time to detect and block the scraper, and so on.

Clareity evaluated many solutions. We looked at software solutions that can’t be integrated the same way into all sites and wouldn’t work, because the customization cost and effort would make it untenable. We looked at hardware solutions that similarly require rack space, installation, different integration into different firewalls, servers etc. – and similarly won’t work either – at least for most website owners and hosting scenarios. We looked at tools that some already had in place –software solutions that did basic rate limiting and other such detections, as well as some “IDS” systems websites already had in place – but none could reliably detect today’s sophisticated scrapers and provide adaptability to their evolution. The biggest problem we found was COST – we knew that for most website owners even TWO figures per month would be untenable, and all the qualified solutions on the market ranged from three to five figures per month.

Finally, we had a long conversation with Rami Essaid, the CEO of Distil Networks. Distil Networks met many of our criteria. They were a U.S. company, with a highly redundant U.S. infrastructure. They provided a highly redundant infrastructure (think 15+ data centers and several different cloud providers) allowing for not only high reliability, but an improvement to website speed. What they provide is a “CDN” (Content Delivery Network) just like most large sites on the Internet use to improve performance – but this one also monitors for scraping. We think of it as a “Content Protection Network” or “CPN”. Implementation is as easy as re-pointing the IP address of the domain name. They also have a “Behind the firewall” server solution for largest sites – more like what Realtor.com uses. Most importantly, once Clareity Consulting described the challenge and opportunity for our industry, they worked to tailor both a unique solution and pricing for our unique industry challenge. If adopted, using this custom solution Clareity can monitor industry trends and help the industry take action against the worst bad-actors.

Some MLSs have already successfully completed a “beta” and seen the benefits of both blocking scraper “bots” from their websites as well as the performance gains, and now more than a dozen other MLSs have already started their free trials and will be considering the best way to have all subscribers enroll their websites as a reasonable step to protecting the content.

If organized real estate actually organizes around this solution, allowing us to collect the data to stop the scrapers and go after the worst offenders, we will be able to get our arms around this problem once and for all.

For more information:
http://realestate.distilnetworks.com

UPDATE: Distil Networks has become a client of Clareity Consulting. Clareity is assisting Distil Networks reach out to the real estate industry with its solution for our industry’s critical problem.

UPDATE 2: Clareity's project for Distil completed - we hope we have provided them sufficient guidance so they can make an impact on our industry's security issue.

Profile

mattsretechblog: matt cohen (Default)
Matt's Real Estate Tech Blog

Most Popular Tags

Legal

This blog is for informational purposes only. The author shall have no liability in connection with any inaccuracies or omissions herein. All trademarks are the property of their respective holders. The views expressed on this blog are those of the author and do not necessarily reflect the views of his employer. Non plaudite, modo pecuniam jacite.