dw_dev | Named Markup Formats

Named Markup Formats

I've been working on a thing, and I could use some feedback on the implementation. It might take a little explaining (because there's a fair amount of backstory), but I'll try and be as concise as possible.

Backstory: Raw and transformed text

DW stores the text of entries and comments raw, exactly as the user entered it. Then, whenever we need to display that text, we transform it to produce nice legible HTML. Those transformations include:

Turning <user> tags (which aren't real HTML) into user, which is really like a span plus an image plus a link.
Handling <cut> tags.
Several other things, etc., not important right now.

Most of those get applied to everything we display. But there are also some OPTIONAL transformations:

Turning normal line breaks into HTML   tags.
Turning bare URLs into clickable links.

Those get applied by default, but we've always had a "don't autoformat" checkbox (inherited from LJ) that could be used to disable them for an entry or comment. (BTW, under the hood the RTE saves entries as "don't autoformat.")

Then, later, DW added some other optional transforms, which had their own special enabling conditions:

Turning Markdown into HTML. (For entries that start with a special !markdown glyph, or comments submitted by email.)
Turning @mentions into user tags. (Currently applies to everything except "don't autoformat," but gets suppressed within certain HTML elements or their Markdown equivalents.)

ALL of these transforms get handled by something called the "html cleaner," at LJ::CleanHTML. At this point "cleaner" is kind of a misnomer; in actual fact, it's the central place where we handle all transformations of raw user-entered text into a fragment of display HTML.

The problems

In my understanding, the current state of affairs has two main problems:

The interface for choosing text transforms is incoherent. That happened gradually; we've added new transformations over time, and changed the interactions between them, and now it's weird:
- Half of the interfaces for entering entry/comment text don't even include the "don't autoformat" checkbox anymore.
- The way you enable Markdown has always been a mystery. For example, I want to use Markdown in comments (because typing html angle brackets on a telephone is bullshit), but currently it's impossible except when responding via email.
Introducing new text transforms is dangerous and chaotic. In mid-2019, we enabled @mentions for HTML-formatted content (previously they only worked in Markdown content), and about 40% of hell broke loose:
- Current content suffered because we didn't have a good way to beta-test @mentions, so we didn't have a chance to learn about bugs and edge cases from our users (who are more ingenious at doing weird textual shit than we are) before enabling them globally.
- Old posts weren't written to expect @mentions, so we ended up totally vandalizing any historical post that ever discussed CSS code (@media (min-width: etc...)), Perl or Ruby or Objective-C code, or a wide variety of other things that involve @ signs.

Questions: Does anyone disagree with those two problems or my characterization of them? Does anyone see any closely related problems that I'm not recognizing here?

The solution, maybe

I've got a pull request up for discussion right now that tries to solve these immediate problems, as well as several other problems I can see coming down the road at us.

In brief (lol, sorry, I'm trying!!), here's what I'm proposing we do:

Instead of a set of disconnected text transform options that can be independently twiddled, we codify a set of named markup formats. Each entry or comment will specify exactly one format.
- To start with, this will include all of the valid combinations of the current formatting options, which ends up being fewer than you might think:
 - markdown0 - Markdown via the Text::Markdown module, plus our own @mentions.
 - html_casual1 - Classic HTML-but-it-respects-your-linebreaks, plus @mentions.
 - html_casual0 - Like casual 1, except without @mentions. Old content was written to expect this, but we don't respect that anymore. Imported and syndicated content still uses this, though.
 - html_raw0 - "Don't autoformat." No @mentions.
 - html_extra_raw - Syndicated content that we know will never use special DW tags like <user> or <cut>.
- (For existing posts that don't specify a format, we use the existing independent options to guess a format. We also take the post date into account, so we can restore casual 0 on posts that were originally written in it.)
When formatting text for display, the specified format determines all of the transforms we use.
- No more independent twiddlies. The format captures the entirety of the user's intent for how their markup should be handled.
Formats are versioned.
- If we want to introduce new formatting behavior (like @mentions) or make changes that might have non-linear consequences (like switching from Text::Markdown to a different Markdown processor), we have to implement it as a new format (e.g. markdown1).
- Old versions stick around in the HTML cleaner FOR ALL ETERNITY, so that we can continue to display content as it was originally intended.
- This doesn't preclude making small changes for safety or consistency -- for example, putting a user tag inside a link always used to generate illegal link-within-a-link markup that we'd just ignore and let your browser sort out, but I recently made it so we strip the inner link. But anything that changes what a user would expect to get from their markup should cut a new format.
When posting or editing a comment or entry in the web UI, the form includes a drop-down for choosing which format you want to use.
- (With a descriptive name, not weird IDs like html_casual1.)
- We only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0.
  - The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change.
- ~~We remember the last format you chose when making a new post, and use that as your default.~~
- (new detail) If you deliberately change the format to something other than your current default, we give you the option to set that as your new default after saving your post/edit.
- If we're introducing a new format, we might set it as active it but mark it as beta, and keep the format it replaces for a while. When posting with a beta format, you're accepting the possibility that the markup behavior might retroactively change for that post, until the beta ends.
The selected format in a comment/entry form might change other behavior of the form.
- Specifically, I'm thinking of markup helper buttons for the most common text styles, which is another back-burner project I'm working on -- the bold/italic/link buttons would add HTML tags for the HTML formats, and Markdown styles for Markdown.
- Also thinking of the future replacement RTE here -- RTE would get its own named format (even though practically speaking it will probably emit something that looks like html_raw0), and switching your format to "Rich" would initialize the editor and its controls. Switching your format to something non-rich would serialize the editor's buffer as HTML and take you back to a plain textarea; we wouldn't need a separate control for switching to the RTE beyond the normal format drop-down that everything else uses.
- ...I guess we could also add syntax highlighting or something for HTML and Markdown, but I don't think anyone's planning on it and I doubt anyone's desperate for it.
(new detail) When posting via email, we'll add a new post header for formatting. You can specify either a format ID (like html_casual0), or a shortcut ID to choose the most recent versions of casual HTML or Markdown. If you omit the header, it posts with the current global default for that type of content, ignoring your user default.
- Why: Email is sort of halfway between a UI and an API. People who are automating email posts want to choose a stable and predictable format (so use IDs), but people who are writing in their real mail client probably want their posts to act like the version of the web form they're accustomed to. We'll ignore user defaults because your email client doesn't give you any feedback about what's going to happen, so we don't want it to depend on hidden settings state.
(new detail) When posting via the new API, you can specify a format ID or fall back to a global default. User default is ignored.
(new detail) Email and API posts aren't limited to the currently active formats; they can specify any format they want, including obsolete ones. This should help us avoid breaking automation that people build.
(new detail) When posting via the old XMLRPC API (so, old LJ clients like Semagic), you cannot select a named format. You'll be limited to today's formatting options (the "don't autoformat" checkbox if supported by the client, and the !markdown secret glyph), which will forever behave exactly like they do in mid 2020 (resulting in either markdown0, html_casual1, or html_raw0).
- Why: Old client programs want things to keep working the way they've always worked... so that's what they'll get. And we don't want to add more secret glyphs, because they make things harder to maintain and understand.
(new detail) If a browser extension happened to mangle the web forms to re-enable obsolete formats, we wouldn't care. From our perspective, that's the same thing as an API client posting with a weird format, which is fine. (Removing obsolete formats from the menu is just about keeping things simple for most users.)

I THINK this will free us up to be much more nimble about modernizing the way we process text, and will increase user choice while ALSO making the site less confusing to use, which is a rare combo.

Questions: Does the approach make sense? Can you think of anything it would sabotage or prevent us from doing in the future? Can you see anything excessively complicated in the code itself? (By which I mean, can we make this simpler while still solving those two problems.)

Non-linear consequences

Here are the things I can think of that might be affected in weird ways by this change:

External clients and integrations. We'll be expecting new entries and comments to arrive with a format specified (using the new-ish editor property), so anything that posts content without that will be locked into the "guessed" format that fits their other options. (that's raw 0 if they set "don't autoformat," and casual 1 if they didn't.) Over time, that'll drift more and more out of date; when we add some new @mention-like thing that forces us to cut a casual 2, we're going to keep guessing casual 1 for posts with no metadata, because all of OUR stuff properly sets the format type.
- TBH, I think this is 100% fine. Feel free to make a case for otherwise, tho.
All the documentation/FAQ pages about formatting stuff will be immediately out of date.
- On it, don't worry.
Once we finally make Markdown discoverable, people might start requesting we add other oddball lightweight markup formats, like textile or RST.
- I think the answer to that is basically "no?" It should be "no."
Another thing that might happen is that we start getting pressure to add some of the niceties of more modern Markdown implementations, like fenced code blocks and ascii-art tables.
- Well, at least we'd be able to do that safely, by cutting a markdown1 and leaving existing content on markdown0. It's just a question of what we actually want our Markdown to act like, and THAT is going to be an exciting conversation.
Doubling down on the importance and centrality of the HTML cleaner, which is a highly complex thicket of code that not a lot of people feel comfortable working with.
- I'm not aware of any plans to replace the cleaner, but this would definitely require them to be rethought. That said, we're always going to need SOME central thing that governs text transformations, and we could move the implementation to a different spot in the code if need be.

Please suggest more of these if you think of them.

Flat | Top-Level Comments Only

This sounds like it will retain the obnoxious bug-in-practice that goes like:

* User goes to start an entry
* User picks a format at semi-random
* User posts
* User discovers that it was not the intended format
* User edits to pick intended format
[some time passes]
* User makes new entry
* Format of new entry retains the original, incorrect, randomly selected format
* User believes that last entry was posted with correct format, is annoyed that this one isn't
* User edits most recent entry
* Cycle repeats

That is an excellent objection, thank you. (I always wondered what the history of that top level “always use rte” setting was.)

How about: when editing, the format switcher has an “also use this format next time” checkbox by it?

...Not yet sure that wholly solves it, also gotta find space for that checkbox and its label somewhere.

Edited 2020-06-05 23:48 (UTC)

Oh, or alternately: if you edit an entry or comment and you change the format type, we land you on an interstitial page with, like:

$thing has been edited.

- View $thing

Format was changed to $new; would you like to set it as your default format?

Editing entries always does that anyway. Comments sometimes hit you with an interstitial (if they’re screened, if you unscreened the parent, if you logged in at the same time), could just add that as a new condition.

That way we wouldn’t need extra UI clutter, at least.

Anyway, now you’ve also got me thinking about whether we should always update the default on new posts. Especially for raw, since that’s such a “sometimes food” of a format; any time you use that, you’re probably going right back to your real default next time...

I wonder if there are people who (nearly) always use it.

Stats on that sound like a Mark question.

Before I discovered the markdown tag and @ mentions, I almost always used raw html. Because I'm that kind of old, who started doing their web pages in raw html, and still writes that way even when there is a wysiwyg editor. So it might be that introducing an option that makes explicit the options for markdown might decrease the frequency with which raw html gets used?

What is "RTE"?

Rich text editor. We have one, but it’s only available on the classic (non-beta) create entries page. It’s old and difficult to maintain, ongoing project to replace it.

> ongoing project to replace it.

What do you plan to replace "Rich text editor" with?
Would it make sense to just delete it?

Another rich text editor with more modern, maintainable design.

No, going entirely without a rich text editor of some kind is a non-starter. The features of it are somewhat negotiable, the existence of it isn't.

I believe this (or perhaps a simplified version) should be posted to a user-centric community instead of/in addition to here, so that end users can opine.

Fair, but I was hoping to get some technical feedback first, and also I don't have direct posting rights to the user-centric official communities (except

dw_beta, which is not the place).

I would like to mention tangentially, from a relatively technical end-user perspective, that the inconsistency in places is horribly grating even when I do know how to and often prefer to write raw HTML. Specifically, though I'm not sure where this fits into the above, the amount of  in our profile text because there's no way to disable having a   jammed into the middle of what would otherwise be a void between list items or something.

Also, I've gotten into the irritating habit of never writing newlines in comments (I'm doing it right now, in fact), because I have an Aversion to having  be rendered as  , and the “no, don't give me this comment form, give me the one which allows me to do that but also doesn't show context in a remotely ergonomic way etc.” alternative got too grating to use over time. I take it the Markdown conversion does better at this?

+1 about the inconsistency being a pain!! That’s kind of my whole character backstory around here.

Yeah, markdown translates “2+ newlines” as “real break,” which I way prefer. The current version here always eats incidental newlines (like, within a paragraph, so poetry needs real tags), but a future version might not, tbd.

Profile text: not really included in the current PR, but definitely on the map. You should be able to select formats the same way almost everywhere. (Not subject lines, but everything else really.)

Do you know about the special <raw-code> tag that disables auto-newlines for a stretch? Works pretty much everywhere on the site, including (pretty sure) profiles. You might find it handy while we’re still figuring this all out.

agreed on wanting p tags dividing paragraphs instead of br tags

Would it be possible to delete some features before adding new features?
E.g., is it possible to delete "Disable Auto-Formatting" option?

- If you mean the checkbox labeled "don't autoformat," that's going away as part of this, yes.
- If you mean taking away the ability to do raw HTML without getting your linebreaks mangled, that can't go away. It's an important escape hatch that lets people post things that were generated by other programs. (For example, using Scrivener or Word to export a chapter of fanfic as HTML.)

> If you mean the checkbox labeled "don't autoformat,"

Where can I see "don't autoformat," checkbox?

Here:
https://www.dreamwidth.org/update.bml
I see "Disable Auto-Formatting" checkbox.

If you click reply to any comment and then click More Options, you will see “don’t auto-format.” But it really means the same thing as “disable auto-formatting.”

But to answer your original questions about if it is possible to get rid of some features and not have the RTE anymore, you might really like the Beta Create Entries page. You can toggle it on or off at dreamwidth.org/beta and if you click the Settings wheel there you can choose which options you want visible on the page when you’re making posts.

Edited 2020-06-06 14:46 (UTC)

Why does Dreamwitch use different labels for naming the same thing?
The label on such checkbox should be "Don't auto-format:" or "Disable Auto-Formatting", but not both, right?
Or better yet - do not have that checkbox at all and activate such option through some form of markup, because such option is needed only for [rare] advanced usages.

They definitely should become consistent.

I opened:
https://www.dreamwidth.org/beta
and clicked "Turn ON beta testing" button in "New Create Entries Page" section.
My first impression is that the Beta version of:
https://www.dreamwidth.org/entry/new
looks better than the standard version.
It is cleaner and customizable.
I still need to actually use it (post blog entry) to decide if it is, actually, an improvement.

"Create Entries" Beta page may lose changes if I collapse "Settings" without saving. Which is a little bit counterintuitive. But it is a nitpick at this time.

Thank you.

"Spell check" button is no longer needed, because browsers' support for spellchecking is good enough already.
If significant number of Dreamwidth users really need "Dreamwidth spellchecking" - make "Spell check" button "hideable"/optional.

This is not critical change, of course, but like with any other cleanup - less features make product easier to use and easier to maintain.

Spell check is planned for removal for all the good reasons you mentioned.

Cool.
I also generally prioritize features removal over features addition, because feature removal makes codebase much easier to operate.
"Delete feature A and then add feature B" is easier than "Add feature B and then delete feature A".

Edited 2020-06-06 17:31 (UTC)

From the perspective of maintaining the code, definitely.

From the perspective of having a working website, the deletion of some things can't go to production until the replacement is ready.

> until the replacement is ready

"Spellchecker replacement" was ready and deployed many years ago.

Create Entries replacement was not ready any length of time ago, because the rich text editor is not ready yet.

I meant that "Spellchecker" feature, probably, should have been deleted earlier.
I did not mean that Create Entries replacement should be implemented after deleting old version of Create Entries page.
I find current approach with having both versions active during testing and transition - reasonable.

> If you mean taking away the ability to do raw HTML without getting your linebreaks mangled

I do not mean that.
Advanced formatting (including command preserving raw HTML) could be done with advanced syntax in the posting text.

This is thorough and thought provoking.

As the person primarily responsible for much of the mess we're in today, the driving goal behind Markdown and @-mentions is to make Dreamwidth easier to use and more accessible (as a technology). Having to learn to write HTML is not something one should _have_ to do in order to express oneself, and so I strongly feel that we should be moving in directions to make it easier.

That said, we are of course built on the idea of customization and being able to really go deep in what kinds of things you can do to your content. That has many positive things, but ultimately, LJ::CleanHTML is the kind of thing you end up with when you want to make real ultimate cosmic power but also combine it with a reading page that has to combine _my_ journal customization insanity with _your_ entry customization insanity. God help us all.

So here we are. You're untangling my hacks, and I appreciate that.

I think basically I agree with your proposal and what you're thinking here, and I concur that we should draw the line at Markdown (and possibly some additions to it, since the Perl Markdown renderer is pretty minimal).

Ultimately tho, it doesn't look like this will really _change_ user behavior/experience? They will see a new dropdown/UI change that lets them explicitly select their formatting, which will save itself (but I saw the comments above, which is a good point to try to make sure we don't confuse/annoy users by changing their defaults unexpectedly).

One thing not mentioned, but I think because it's obvious but I'll say it anyway so it's documented, crossposting doesn't change here right? Since we just render that out as HTML and pass it raw to the target site and ask them not to mangle it. That should work just fine.

(As an aside, something I've been pondering as I've looked at LJ::CleanHTML in the past is if we could drastically simplify it with some modern technologies. It does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify? I suppose this doesn't fix a lot of the "HTML fixing" we do though, so ultimately it probably doesn't actually simplify the cleaner by any material amount. Oh well.)

Edited (yes I got savaged by my own feature why) 2020-06-06 07:15 (UTC)

—crossposting doesn't change here right? Since we just render that out as HTML and pass it raw to the target site and ask them not to mangle it.

Yes, exactly!! Now that we're pre-rendering everything and force-setting opt_preformatted at the destination (instead of trying to work within a foreign version of casual html), we're home-free no matter WHAT we do. As long as we're:

- Able to render non-broken HTML for our own site
- Translating things that DO directly correspond to magic LJ markup (right now that's just @mentions, but if we ever made a shorthand syntax for e.g. cut tags we'd want to handle that too)

...then crossposting stays fine.

Ultimately tho, it doesn't look like this will really _change_ user behavior/experience? They will see a new dropdown/UI change that lets them explicitly select their formatting—

That's my hope, yeah — the user side of this should just feel like we replaced the "don't autoformat" checkbox with an easier-to-understand dropdown. The only complex bits from the user perspective are:

- How we set user defaults (cf. above).

- The combination of "current format affects editor controls" and "RTE is just another format". This makes perfect sense to me, but if we, e.g., ended up with an RTE that had its own integrated "view source" mode, we could end up in a nested hierarchy, where I chose rich formatting but I'm viewing/editing HTML code as part of the implementation of "rich." Which I think I don't actually have a problem with, but it might be worth thinking through those contradictions.

- How we handle obsolescence of formats. (The "but I liked the old way" effect -- would we want to offer a backdoor way to keep your default locked to a hidden format? So far the consensus of staff and volunteers sounds like "no." I'm inclined to agree.)

(—possibly we can just use CSP or such to block [scripts] by policy and then simplify? I suppose this doesn't fix a lot of the "HTML fixing" we do though—)

Word. IMO the security part of CleanHTML contributes a lot to its gnarliness, so even if the text transformation part is always going to have to happen somewhere, it'd still be super cool to make the security part more modern and concise. But when it comes to XSS security, gotta admit that I'm Baby, so I don't have a useful contribution there yet.

if we, e.g., ended up with an RTE that had its own integrated "view source" mode, we could end up in a nested hierarchy, where I chose rich formatting but I'm viewing/editing HTML code as part of the implementation of "rich."

Why so? Like, on Tumblr, I can compose a post in rich text mode complete with (say) italics, switch to HTML mode and find the tags already there, add (say) blockquote tags around a couple paragraphs, and switch back to rich text, and the blockquotes will persist. Might be mangled some, especially if I switch back and forth several times. (no, Tumblr, I want a single blockquote containing multiple paragraphs! this is a thing you have done before! why are we breaking the blockquote up by paragraph with <div> tags??) And I don't have the first idea how any such thing would be implemented on Dreamwidth, or to what extent its implementation on Tumblr might rely on other Tumblr functionality that Dreamwidth doesn't and doesn't plan to have. But it can be done.

Yeah, it can totally be done, it's just a question of how the other constraints balance out. Like you've noticed, most RTEs do some normalization of the HTML when they deserialize it into their internal models, and that's a lossy transformation. Depending on HOW lossy, it might make sense to use an RTE's custom "html edit" mode for tweaking the underlying HTML, because the editor would be able to show you syntax problems while you're still tweaking instead of just throwing away what it didn't recognize when you switch back to rich mode.

wohali was working to get CKEditor 4 integrated at one point, and it sounded like the lossiness was possibly extreme enough to consider that? I don't remember the full details.

On a site that's more aggressive about limiting the subset of HTML it lets through, these tradeoffs work differently, but DW's anything-goes approach adds some complexity.

I agree, roadrunnertwice has done a really nice job thinking this through.

[LJ::CleanHTML] does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify?

It might just be my age and remembering a time before CSP, but I'd pretty strongly recommend against relying on CSP alone. CSP is great as a backup for when something gets through, but there are a number of cases where even a perfectly-specified CSP is not going to work very well:

Any browser that doesn't understand CSP. (Older browsers, at least.)
Cross-posting. Gotta clean stuff before shipping it off, both so DW doesn't look like an attacker to any automated systems, and because relying on the other end to be secure seems like a not-great idea.
Non-HTML contexts. Presumably you'd still want to clean things before they end up in RSS, for instance, so JS doesn't end up in somebody's feed reader.

More generally, I'd argue that relying on CSP would move security issues from DW's control to third-party control (browser vendors, cross-post targets, e-mail clients, feed readers), which is a lot of faith to put in other people getting security right. Yes, use CSP! But in conjunction with existing security measures, not as a replacement.

It might be possible to simplify things by layering them—one layer to do input transformations, and another for security transformations, say—but that would require very careful design if input transformations need to produce things that would be rejected by the security layer (so no reserved HTML classes or ids, at least without shenanigans like using a random nonce or signed token that gets replaced with the reserved value).

i took a look at LJ::CleanHTML and the code was way nicer than i was expecting from these comments :) then again, i just read it rather than trying to modify it - are there specific thing that people find hard to work with in practice?

(as an observer only)

Part of the cognitive load is the knowledge that if you screw up, there's the potential for both known and as yet unknown nastiness to get through.

When it sets your default for you, would that apply to posts or comments you email in, too? Or would you still have to identify if you want markdown at the top of your email?

I think editing a post to a different format should also set that format as your default, if we’re going with using a format sets it as default. I think if anything there should be a ticky (possibly on an interstitial page) asking to check if the user doesn’t want it set as the new default.

Otherwise just don’t set it automatically at all anywhere, and always include a ticky for making it your default. Those are the two options I see. I don’t see a compelling reason for the editing page to behave differently than post or comment.

I don’t see a compelling reason for the editing page to behave differently than post or comment.

It's mostly in case of editing entries that are in an obsolete format, or an active format you were using before you realized you preferred a different one -- the old clothes might not fit the new you.

Anyway, I'm leaning toward being more explicit; most appealing thing I've thought of so far is: if you save a post in something that's NOT your default format, you end up on a "success" page that includes a "set this format as my default" button.

Edit: Oh, just realized I didn't answer your first question. 😑 For email and API posts, the user's default format shouldn't apply; we'll use the specified format if there is one, and fall back to a global default otherwise.

Reason being, email and API don't give you any feedback about what your settings are before you hit send. So it'd be like, "hope you remember exactly what you set previously," which, bad! So for automated or semi-automated stuff like that, deterministic consistency is way more important than convenience.

Edited 2020-06-06 22:13 (UTC)

As for "any closely related problems", I'm wondering how would it behave when you switch between formats? Currently, when switching between RTE and raw HTML, it sometimes "bakes in" line breaks, adding
tags where you didn't want them, even repeatedly. (I don't have specific reproducible steps but I'm fairly certain it wasn't just a user error.)

Otherwise, I think this is a very neat solution to the problem and can't see any obvious issues with it.

(Also, yay for markdown in comment forms! I'm not very taken with the current DW markdown flavor, but it still beats writing HTML markup by hand.)

For plain formats like Markdown and the HTMLs, we wouldn't transform the source at all when switching between formats; the poster's in charge of knowing what they're writing.

For rich text, it gets squirrelly, like you've noticed and like

alexseanchai and I were talking about above. Current thinking goes something like:

- Switching to rich text if you have some current text entered will try to translate it to fit the RTE's internal model. We'll want to be able to deserialize both HTML and markdown, and we can probably do something to distinguish between raw and casual HTML to preserve most of the user's intent.
- But that's a lossy transformation, and there's no way around that.
- So it should warn you about that, and give you a chance to cancel out.
- It should also warn you when switching OFF the RTE -- the transformation to plain HTML would be non-lossy, but if you're about to try and tweak something and then back to rich mode, you'll hit a lossy transformation step again, so you should be aware of that beforehand.

I might be the only one affected by this, but ... "only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be markdown0, html_casual1, and html_raw0." -- I use a shell script to crosspost between DJ, IJ, & DW (I was using a command-line client for all three but it stopped working with DW so I switched to the script using post-by-mail for here), which means I'm basically using coding without @mentions, and have been using @ to mark Twitter usernames in attributions for quotations. So the ability to select html_casual0, especially in post-by-email, would be useful to me. (I've currently got the usual way Twitter IDs show up in my entries special-cased in the script

:%s/(@$[^ ]*$)/(@\1.twitter)/g

but sometimes they're not in quite that context and I have to hand-edit my entry after it's posted.)

Basically, I'm trying to post as though it were syndicated, despite not using a syndication API. So it'd be nice if I had access to a syndication-friendly input format.

(Also, is there a way to specify "do not autoformat" in post-by-email? My script currently tries to reformat from html_casual0 into something suitable for allow-autoformat.)

The other reason I'd like to keep an html_casual0 option is that it's easier for me to edit in HTML with line breaks in my source, but specify presentation linebreaks with and , which is what I used to get with "clive -p". So my script'd be more reliable if I can strip out the bit where it attempts to remove existing linebreaks and replace .. and with new linebreaks.

I have no idea how many other people are still using command-line clients to post entries though. (BTW, if anybody's hacked Clive to work with DW again like it used to, and are willing to share your diffs, I'd be grateful. I was using Clive 0.4.9 when the DW API changed just enough to break it, but looking at the Changelog for Clive 0.4.10 it doesn't look like any DW support was added to it.)

Thanks for sharing that use case, it's an interesting one!

Couple of things:

- Today, email posts can't set formatting options. (Comments are locked to Markdown, and posts are locked to default behavior i.e. "casual 1".) But tbh, why not? If we go forward with this, I'd be in favor of a `post-editor` header that lets you choose your own format.

- Same thing for posts via the new API; we'd let you set the format of your choice.

- About obsolete or weird formats: I actually don't care whether API clients (or email) use these; in fact, there's very good arguments for letting semi-automated stuff like that use every valid format. So my current implementation does nothing to block it, and you'd be able to use html_casual0 just fine. (The thing I DO care about is keeping the post/reply web UI clean.)

Although: it kind of sounds like you'd rather post in html_raw0, since you're entering your own and tags anyway.

I read this, and understood some. Here's my contribution: Because I am a monster, and because DW allows it, I'll often specify !markdown in a post and then go ahead and use an unholy mix of HTML and Markdown. Would that mixed usage--perhaps for older posts only--be covered by one of your format modules?

Also I use Semagic to post to DW, so I wouldn't be able to select a format for each post. Instead of having the site guess what format to use, would it be possible to create a "!markdown" like command to put at the head of a post to indicate the format? Or maybe have an option in settings where a user can choose a default format for new posts.

Mixing in HTML at whim is officially part of the Markdown language, so that stays, even if we switch to a different type of Markdown. 👍🏼

Old clients like Semagic are gonna be out of luck on this one, I think -- they'll be locked to mid-2020 behavior (that is: default html_casual1, optional html_raw0 if they support a "don't autoformat" checkbox, markdown0 if you start the post with !markdown) for as long as they keep working. Reasons being:

- I'm trying to get AWAY from weird secret glyphs like !markdown, so I don't want to add more of them.

- Adding new global settings is bad.

- Being able to set a "hard default" like that for API posts would introduce really confusing and unexpected behaviors, I think — instead of "no format specified" having a single reliable effect, it could have any number of effects depending on remote settings. In fact, that could cause existing clients to break even faster than they're already breaking. Better to have API calls be predictable.

Got it! Thanks for the response!

I've read this (and bits of the git repository you linked to) and while I have not noticed any issues, I'm not likely to, because I don't understand enough of the code I'm reading! I am, however, very willing to do some testing and see if I can break things, if there is a way for me to do so.

In fact, there is! This stuff is live on http://rr-thrice.hack.dreamwidth.net (my dev server), and you can shake it down there if you’re interested.

- For just commenting, you can log in with your dreamwidth openID without having to sign up.
- if you want to post entries too, I think the invite code “ 5GNA9R3SWM656AAAAAAN” is still active.
- Anyone else reading this wants an invite code, just say so.

Cool! Thanks. I've managed to at least get the site to load (mumble mumble safari mumble won't open http mumble mumble); will have to investigate further when I find a spoon or two.

I don't know the code (haven't touched Perl seriously in 15 years), but as a techie user who wrestles with similar problems on my own site: LGTM. In particular, it sounds like it gets me what I *personally* want, which is to be able to use Markdown consistently without needing the glyph...

hard same.

This looks great! Couple of questions:

----

1) What happens if you find a security issue in an older formatter and need to overhaul or replace the implementation? (Perhaps the implementation is so broken it *can't* be fixed, and needs replacement.)

Can DW reasonably do a search to sample for posts and comments where the change makes a meaningful difference in output? And what happens if there's a discrepancy?

Technically, your proposal wouldn't change this situation all that much, since DW *already* has this problem. But it might offer more solutions. For example, you could "clone" markdown1 as markdown1_pre_20210101, set old markdown1 posts to use it, and fix markdown1 in place. If new posts are *forbidden* from using markdown1_pre_20210101, and editing an old post bearing that formatter switches it to markdown1 so that people can't use them to host exploits, then you could have your cake and eat it too...

This would be more complexity, and I'm not suggesting it's a great idea, but I think it's worth considering how that might play out. Old libraries sometimes have to be entirely replaced, and those corner cases can be *hard* to replicate.

(Another possibility is rendering all the old content to HTML in order to freeze it, and storing that alongside the source markup.)

----

2) You write about the scenario where someone edits an old post:

« The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change. »

What happens if I edit an old post and change the format selector to a newer format, and realize I've made a mistake and actually need it to go back to the deprecated format? Will the selector still show the *previous/original* format?

2) [...] What happens if I edit an old post and change the format selector to a newer format, and realize I've made a mistake and actually need it to go back to the deprecated format? Will the selector still show the *previous/original* format?

As written (in the newly updated version of this PR), no, you'd be outta luck. (Unless you used a userscript/browser extension to send an old format type, which we wouldn't do anything to prevent.) My hope is that:

1. People mostly won't do this.
2. The consequences of oopsing it on one post won't be very large. For example, if you accidentally lost html_casual0 on a thing, you'd just have to make sure that post wasn't using @ signs in ways that look like user mentions, and escape them if so. The real hell of format updating happens when you need to update EVERYTHING -- updating ONE thing should be fairly mild.

If that DOES turn out to be a major annoyance, I bet we can think of some ways to deal with it. First one that occurs to me: store the "official" date when each format went obsolete, and when editing, add in every format that was still active at the time the post was originally made. Anyway, I think I feel fairly comfortable leaving that as a problem to solve if-and-when.

1) What happens if you find a security issue in an older formatter and need to overhaul or replace the implementation?

Are you trying to curdle all the milk in my fridge, or what. 😅

The idea of (best-effort secure clone) + (locked version of insecure original) is really interesting, thank you for thinking through that! (Also glad I'm not the only one who finds this an interesting family of problems to think through. 😄)

But I think the most likely scenario would be that, as a matter of ops policy,

mark and co would want to get vulnerable code OUT of the site in as complete a way as possible, even if we were pretty sure there were no sploits hiding in old posts. If that's how it shook out, I think we'd see something like:

- Replace vulnerable format with a best-effort secure clone, which probably has some edge-case inaccuracies.
- Fix those edge-case bugs as they're found, and count our blessings that it was a security bug whose long-term fallout was just "some old posts might have rare display glitches for a couple months."

Doing a broad sample from the database to check the quality of a clone seems pretty trivial, probably. "Baking" old posts into HTML... that actually sounds pretty gnarly. Actually, I was just reading about StackExchange's big Markdown processor changeover the other day -- did you know they dealt with that by running a 💀world-ending database migration that did sight-unseen automatic edits on every piece of user content ever submitted?💀 I think DW probably doesn't have the resources to attempt something like that, and a big HTML bake-fest is sort of a milder version of that same idea.

Anyway, good thinks, thanks for raising em.

As written (in the newly updated version of this PR), no, you'd be outta luck. [...] The consequences of oopsing it on one post won't be very large. [...] Anyway, I think I feel fairly comfortable leaving that as a problem to solve if-and-when.

Reasonable!

Are you trying to curdle all the milk in my fridge, or what. 😅

Haha, I'm adding this to the "quotes" section of my résumé. ;-)

But I think the most likely scenario would be that, as a matter of ops policy, mark and co would want to get vulnerable code OUT of the site in as complete a way as possible, even if we were pretty sure there were no sploits hiding in old posts.

And you might have to, e.g. if a software package becomes no longer feasibly available for the server environment.

Mismatch detection could be done with a dark launch. Run both formatter versions on the source, only use the output of the old one, and log the IDs of posts and comments where the mismatch occurred (and whether exceptions were caught on the new one). No need for a full DB scan, at least at first.

Tangent: For random sampling, I'm imagining you could even do a privacy-preserving transform on source and output HTML, where you replace all letters with "a" and all digits with "1" to allow you to look at non-public content where a mismatch was detected. (Sufficient? Acceptable by DW internal standards? No idea!)

Anyway, sounds like there are plenty of options and no decision is needed ahead of time. Agree that fix-forward is probably fine.

Flat | Top-Level Comments Only

Named Markup Formats

Backstory: Raw and transformed text

The problems

The solution, maybe

Non-linear consequences

no subject

no subject

no subject

no subject

no subject

RTE

Re: RTE

Re: RTE

Re: RTE

no subject

no subject

no subject

no subject

no subject

Reduce complexity first?

Re: Reduce complexity first?

"don't autoformat," checkbox

Re: "don't autoformat," checkbox

Re: "don't autoformat," checkbox

Re: "don't autoformat," checkbox

"Create Entries" Beta page

Further cleanup of "Create Entries" Beta page

Re: Further cleanup of "Create Entries" Beta page

Re: Further cleanup of "Create Entries" Beta page

Re: Further cleanup of "Create Entries" Beta page

Re: Further cleanup of "Create Entries" Beta page

Re: Further cleanup of "Create Entries" Beta page

Re: Further cleanup of "Create Entries" Beta page

Re: Reduce complexity first?

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

html_casual0

Re: html_casual0

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject