dw_dev | crawling comments on a community?

Hi!

I'm trying to crawl and parse comments on a community for a fandom event (http://hs-worldcup.dreamwidth.org , if you're curious). I've run into a bunch of issues and lack of API documentation, and talked to DW Support a couple of times, and feel like I am further away from successfully doing anything than when I started. Before I say anything else, here is What I Am Really Trying To Do:

take an individual community post (example: http://hs-worldcup.dreamwidth.org/3493.html#comments)
download all of the comments with some sort of threading information --- the data I need in particular is comment subject, comment author, comment content, whether or not it's a reply and if so to what
parse out that data and do transformations to it and add it to a database (which is not super relevant to this question I don't think but I can go into more detail if necessary)

I looked into the API for getting comments which led me in a roundabout way to www.livejournal.com/developer/exporting.bml . I'm probably missing something obvious here, but I don't actually see how this tells me how to make an API call? It gives me a GET request, but not what to send the GET request to? Also, support told me the only action DW supports here is "Get all comments for a community," not "Get all comments for a page," and I should probably just crawl the pages. Is that what other folks have done when doing this?

If that is what I should do, how do I get around the adult content warning? Is there a flag I can pass with the URL or something? Do I need to do something more complicated than just using curl to grab the pages? Is there something I can pass to say "just give me one piece of HTML with all 5000 comments on it it will be easier for both of us probably?"

Thank you for any suggestions or advice you might have.

Flat | Top-Level Comments Only

There is no API for "give me comments from this entry", there is just "give me all comments on X account" and that is the one you linked to. The docs are on LJ's web site, our API is just the same. You send a GET request to dreamwidth.org/export_comments and you can get back XML data of comments.

This API is a very proto-REST style API except it uses XML instead of JSON. You just send HTTP requests like a browser (with the same cookie authentication). There is a reference implementation available:

https://github.com/markpasc/jbackup/blob/master/jbackup.pl

You could also look at the Dreamwidth implementation of the content importer:

https://github.com/dreamwidth/dw-free/blob/develop/cgi-bin/DW/Worker/ContentImporter/LiveJournal/Comments.pm

I hope this is helpful!

Thank you for providing me that URL --- that's got me on the right track. Serious question: How was I supposed to figure that URL out from the documentation? Would it be obvious if I thought like a programmer? Because it was not. :)

The code doesn't make sense to me at a glance but if I run into problems I appreciate having it to go and pore through. Thanks again!

I wrote those docs originally, and they're pretty much unmodified in the 10 years since I wrote them. Heh.

The docs were written to target existing API developers who had worked with the flat/XML-RPC API that pre-existed, so it assumed you knew which domain to use etc. It's definitely not clear. I think most people end up using the reference implementation.

Actually, when I moved from testing as a user (which worked fine) to testing as a community, it seems like this API just doesn't let you get comments for a community at all. Is that true? :(

It should work fine for communities. You have to be an admin of the community though; you can only export content that you control. Basically, if you can delete it you can export it.

Oh! You have to use &auth_as=rax_test_comm but be authenticated elsewhere in your session as rax. That's... is there anywhere someone is working on re-documenting this? Still, that helps a lot, thank you!

Check recent entries in this comm -- we're working on deprecating the existing API and replacing it with one that was, uh, conceived in the 21st century.

I saw that. It looks awesome, but it's general enough that I have no idea how I would contribute other than to gesticulate wildly and say "the current state is really confusing and bad, and there is sure a lot of text about this new thing here, yup!" What I could do (and will if I have time) after figuring all of this belunkus out is write instructions on how to use what is currently there based on having wrestled with it.

Well, my point is that with a bit of luck, the existing APIs will be deprecated within the next year or so and people who want to do things like what you're doing will be able to use the new API. So, while documentation is always a good thing, documentation of the XML-RPC API might be a bit of wasted effort!

Which is not to say "don't do it", just that it may be made obsolete fairly quickly.

I'm happy for potential-future-me to have a potential-future-solution, but I want to make sure current-me has something they can use, too. :)

I'm not aware of anybody actively working on documenting this better/at all. It's relatively little used; admittedly, possibly because it's mostly undocumented.

We're always happy to have people work on the docs. We've generally been using the wiki for documentation like this, and if you're interested in helping, I can point you places (or rather, I can ask some awesome folks to step in and help point you, since they are awesome and know the current taxonomy etc).

Thank you for answering my question at http://dw-dev-training.dreamwidth.org/65408.html . I have questions about http://www.dreamwidth.org/export_comments?get=comment_body&startid=0 . Is there any way to get poster's name from posterid that is returned by export_comments? How do I determine which post the following comment belongs to?

<livejournal>
    <comments>
        <comment id="1" jitemid="0" posterid="123456" postid???>
            <body>Some Text.</body>
            <date>2004-05-21T22:29:57Z</date>
        </comment>
        ...
    </comments>
</livejournal>

If there's no way to get poster's name and post id for a given comment, is there a way to programmatically download a post page with all comments on it as if I'm logged in to DW? Suppose this is a private post: https://ari-linn.dreamwidth.org/535040.html?format=light&expand_all=1 If I were to download it, how do I authenticate?

GET https://ari-linn.dreamwidth.org/535040.html?format=light&expand_all=1 HTTP/1.1
Host: ari-linn.dreamwidth.org
Authorization: Basic base64_encode("user:password") ??? or do I somehow get a cookie from DW?

Showing up very late to the party... I took a look at DW's importer code for LJ comments (ctrl+F do_authed_comment_fetch) and discovered the same endpoint will give you a map of user id's to usernames, as well as deleted/screened/frozen status for each comment. Just replace 'get=comment_body' with 'get=comment_meta'.

Don't know if that's going to be of any use to you 2 years after the fact. But I figure if anyone else arrives here by googling stuff about Dreamwidth comment export, it'll be a handy tip. :)

I made a Python script once if that's any help:

https://bitbucket.org/foxfirefey/dwump/src

That could save me a lot of trouble if it works the way it looks like it works. Thank you!

Ah, unfortunately it looks like this can only do users and not communities. I'll look through it and see if there's anything I can work with, though, thank you!

Just FYI on alternative solutions - the way I've seen people approach similar things (but not quite the same) is the have a sock / bot / mod account that has the entries tracked and then have the processing occur on the email for comments that go to the bot's posts / reply to comment.

An example of that set up is this writing motivation comm: http://verbosity.dreamwidth.org/profile where graphs are automatically posted for their word goal.

And in case it's handy here is a script snakeling set up for having a web server automatically post to dreamwidth bookmarks from pinboard https://github.com/snakeling/PinToDW/blob/master/config.php

(And I've used gmail delay send to have comments / posts go out at specific times as practice for at across multiple time zones bingo sort of game.)

So that might be an alternate path to go down.

Edited (correcting autocorrect) 2014-05-19 23:11 (UTC)

crawling comments on a community?

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject