rax: (Silver whaaaaaaaaaaaaaaaaaaaaaaaat)
Rax E. Dillon ([personal profile] rax) wrote in [site community profile] dw_dev2014-05-19 08:32 am
Entry tags:

crawling comments on a community?

Hi!

I'm trying to crawl and parse comments on a community for a fandom event (http://hs-worldcup.dreamwidth.org , if you're curious). I've run into a bunch of issues and lack of API documentation, and talked to DW Support a couple of times, and feel like I am further away from successfully doing anything than when I started. Before I say anything else, here is What I Am Really Trying To Do:
  • take an individual community post (example: http://hs-worldcup.dreamwidth.org/3493.html#comments) 
  • download all of the comments with some sort of threading information --- the data I need in particular is comment subject, comment author, comment content, whether or not it's a reply and if so to what
  • parse out that data and do transformations to it and add it to a database (which is not super relevant to this question I don't think but I can go into more detail if necessary)
I looked into the API for getting comments which led me in a roundabout way to www.livejournal.com/developer/exporting.bml . I'm probably missing something obvious here, but I don't actually see how this tells me how to make an API call? It gives me a GET request, but not what to send the GET request to? Also, support told me the only action DW supports here is "Get all comments for a community," not "Get all comments for a page," and I should probably just crawl the pages. Is that what other folks have done when doing this?

If that is what I should do, how do I get around the adult content warning? Is there a flag I can pass with the URL or something? Do I need to do something more complicated than just using curl to grab the pages? Is there something I can pass to say "just give me one piece of HTML with all 5000 comments on it it will be easier for both of us probably?"

Thank you for any suggestions or advice you might have.

ari_linn: (warrior - normal)

[personal profile] ari_linn 2016-12-24 07:16 am (UTC)(link)
Thank you for answering my question at http://dw-dev-training.dreamwidth.org/65408.html . I have questions about http://www.dreamwidth.org/export_comments?get=comment_body&startid=0 . Is there any way to get poster's name from posterid that is returned by export_comments? How do I determine which post the following comment belongs to?

<livejournal>
    <comments>
        <comment id="1" jitemid="0" posterid="123456" postid???>
            <body>Some Text.</body>
            <date>2004-05-21T22:29:57Z</date>
        </comment>
        ...
    </comments>
</livejournal>

If there's no way to get poster's name and post id for a given comment, is there a way to programmatically download a post page with all comments on it as if I'm logged in to DW? Suppose this is a private post: https://ari-linn.dreamwidth.org/535040.html?format=light&expand_all=1 If I were to download it, how do I authenticate?

GET https://ari-linn.dreamwidth.org/535040.html?format=light&expand_all=1 HTTP/1.1
Host: ari-linn.dreamwidth.org
Authorization: Basic base64_encode("user:password") ??? or do I somehow get a cookie from DW?
discurse: (Default)

[personal profile] discurse 2018-08-15 02:32 am (UTC)(link)
Showing up very late to the party... I took a look at DW's importer code for LJ comments (ctrl+F do_authed_comment_fetch) and discovered the same endpoint will give you a map of user id's to usernames, as well as deleted/screened/frozen status for each comment. Just replace 'get=comment_body' with 'get=comment_meta'.

Don't know if that's going to be of any use to you 2 years after the fact. But I figure if anyone else arrives here by googling stuff about Dreamwidth comment export, it'll be a handy tip. :)