rax: (Silver whaaaaaaaaaaaaaaaaaaaaaaaat)
Rax E. Dillon ([personal profile] rax) wrote in [site community profile] dw_dev2014-05-19 08:32 am
Entry tags:

crawling comments on a community?

Hi!

I'm trying to crawl and parse comments on a community for a fandom event (http://hs-worldcup.dreamwidth.org , if you're curious). I've run into a bunch of issues and lack of API documentation, and talked to DW Support a couple of times, and feel like I am further away from successfully doing anything than when I started. Before I say anything else, here is What I Am Really Trying To Do:
  • take an individual community post (example: http://hs-worldcup.dreamwidth.org/3493.html#comments) 
  • download all of the comments with some sort of threading information --- the data I need in particular is comment subject, comment author, comment content, whether or not it's a reply and if so to what
  • parse out that data and do transformations to it and add it to a database (which is not super relevant to this question I don't think but I can go into more detail if necessary)
I looked into the API for getting comments which led me in a roundabout way to www.livejournal.com/developer/exporting.bml . I'm probably missing something obvious here, but I don't actually see how this tells me how to make an API call? It gives me a GET request, but not what to send the GET request to? Also, support told me the only action DW supports here is "Get all comments for a community," not "Get all comments for a page," and I should probably just crawl the pages. Is that what other folks have done when doing this?

If that is what I should do, how do I get around the adult content warning? Is there a flag I can pass with the URL or something? Do I need to do something more complicated than just using curl to grab the pages? Is there something I can pass to say "just give me one piece of HTML with all 5000 comments on it it will be easier for both of us probably?"

Thank you for any suggestions or advice you might have.

mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2014-05-19 04:37 pm (UTC)(link)
There is no API for "give me comments from this entry", there is just "give me all comments on X account" and that is the one you linked to. The docs are on LJ's web site, our API is just the same. You send a GET request to dreamwidth.org/export_comments and you can get back XML data of comments.

This API is a very proto-REST style API except it uses XML instead of JSON. You just send HTTP requests like a browser (with the same cookie authentication). There is a reference implementation available:

https://github.com/markpasc/jbackup/blob/master/jbackup.pl

You could also look at the Dreamwidth implementation of the content importer:

https://github.com/dreamwidth/dw-free/blob/develop/cgi-bin/DW/Worker/ContentImporter/LiveJournal/Comments.pm

I hope this is helpful!
foxfirefey: A guy looking ridiculous by doing a fashionable posing with a mouse, slinging the cord over his shoulders. (geek)

[personal profile] foxfirefey 2014-05-19 05:05 pm (UTC)(link)
I made a Python script once if that's any help:

https://bitbucket.org/foxfirefey/dwump/src
qem_chibati: Coloured picture of Killua from hunter x hunter, with the symbol of Qem in the corner. (A cat made from Q, E, M) (Default)

[personal profile] qem_chibati 2014-05-19 09:23 pm (UTC)(link)
Just FYI on alternative solutions - the way I've seen people approach similar things (but not quite the same) is the have a sock / bot / mod account that has the entries tracked and then have the processing occur on the email for comments that go to the bot's posts / reply to comment.

An example of that set up is this writing motivation comm: http://verbosity.dreamwidth.org/profile where graphs are automatically posted for their word goal.

And in case it's handy here is a script snakeling set up for having a web server automatically post to dreamwidth bookmarks from pinboard https://github.com/snakeling/PinToDW/blob/master/config.php

(And I've used gmail delay send to have comments / posts go out at specific times as practice for at across multiple time zones bingo sort of game.)

So that might be an alternate path to go down.
Edited (correcting autocorrect) 2014-05-19 23:11 (UTC)